
APFS is not safe to use with names which have Unicode normalisation issues - ingve
https://eclecticlight.co/2017/04/06/apfs-is-currently-unusable-with-most-non-english-languages/
======
cstross
The real problem isn't whether filename normalization is a good/bad thing, the
problem is _Apple used to do it one way and is now switching without warning
to doing it the other way_.

It's the logical end product of the odyssey from Apple's original philosophy
of a resource and data fork model for files to the UNIX stream-of-bytes model
for files. The UNIX model traditionally kept metadata about files separate
(anyone else remember naughtily hand-editing directory files with sed(1), back
in the day?) but gradually picked up a bunch of new stuff that has to be
stored _somewhere_. Meanwhile, the Mac platform adopted UNIX binaries with the
move to NeXTStep underpinnings (we're going back to the late 90s here, and the
adoption of OSX over traditional MacOS), obviating the need for the original
resource bundle, which got rid of those annoying errors about missing bundle
bits but left us with a legacy of .DS_Store turdlets in every directory to
hold file metadata that was formerly stored in the resource fork.

As a destination it's laudable — design consistency is almost always laudable
— but it's the end goal of a very messy process and it looks like APFS isn't
quite there yet; NSFileManager and NSURL need some way of distinguishing files
with different unicode representations, and the Finder in particular needs to
be robust. I'm guessing this is going to be fixed when High Sierra finally
ships, but isn't supported in Sierra at present, hence the OP's alarm at the
situation.

~~~
mchanson
"without warning"? Not really. This has been discussed to death on ATP last
year.

Also, in regards to the headline, there are tens or maybe over a hundred
million non-english speakers using iOS already running APFS...

~~~
cryptonector
Did anyone mention normalization-preserving/normalization-insensitive
behavior? That's what ZFS does, and it's great!

~~~
kstrauser
> Did anyone mention normalization-preserving/normalization-insensitive
> behavior?

You did, 11 times so far this thread. I counted.

------
acdha
There's a potentially useful discussion to be had on normalization but the
title is pure clickbait hyperbole. HFS+ is the only filesystem in common use
which performs Unicode normalization and a statement that bold would require
at least some evidence that Windows, Linux, etc. are only usable by English
speakers.

My position on this is mixed. I've had to write code to deal with
normalization changes in archives and it's quite tedious. The HFS+ approach of
normalization is in many ways the best choice considered in isolation but in
practice it's really expensive to support since most filesystems, tools, APIs,
etc. predate Unicode and everyone else chose the bag of bytes approach.

~~~
catwell
Normalization is very expensive and does not belong at the FS level. This
kills performance for some classes of applications.

The comparison with other filesystems does not hold since applications for
other OSs have always been developed with no normalization at FS level, and
hence it was done by the applications, or through the use of high-level OS
APIs. Mac applications, on the other hand, expect it to be the responsibility
of the filesystem. This is explained pretty clearly in the article.

On a side note, I wish Apple would have taken this opportunity to switch
normalization from NFD to NFC, which basically everything else uses. The
distinction causes complexity and often issues in software which share data
between Apple platforms and other platforms, such as version control systems
for instance.

EDIT: according to pilif's comment, they did, which is awesome!

~~~
acdha
Do you have any recent benchmarks showing a significant impact from
normalization? I haven't seen that on anything in at least a decade and that
was simply Red Hat shipping an ancient and completely unoptimized libicu.

> Mac applications, on the other hand, expect it to be the responsibility of
> the filesystem. This is explained pretty clearly in the article.

Again, the article made a huge sweeping claim without supporting it. That's
simply not true in either way – many apps on every OS don't handle this at
all, some handle it consistently everywhere, and what a “Mac application”
means varies widely from “clean, modern Cocoa” to “uses a lot of C, etc.
libraries”, “cross platform C++”, “C# port”, “Electron shell”, etc. You can't
make any statement which is true for every single one of those categories,
much less for every code path which eventually results in a filesystem call.
I've run into cases where something mostly worked until you hit their
integrated ZIP, Git/SVN, etc. support and found a new way that a filename was
constructed.

My point wasn't that everything is fine but simply that this is complicated
and no decision results in avoiding problems. Not normalizing allows for
confusing visually-identical files; normalizing results in errors or data loss
which will be blamed on the OS.

~~~
vintagedave
> Again, the article made a huge sweeping claim without supporting it.

Looks to me like the article made claims and backed them up with examples and
screenshots.

Rather than saying the article is wrong, can you demonstrate /why/ it is
wrong? Using its examples and concerns (Finder, console, and scripts)?

~~~
acdha
> Rather than saying the article is wrong, can you demonstrate /why/ it is
> wrong?

I think you might to re-read my entire comment: note that I'm not arguing that
the technical details are wrong, only that they're insufficient to support the
huge “APFS is unusable” conclusion.

As previously noted, Windows and Linux work the same way and they are used by
more people in individual non-English locales than the total number of Mac
users. Would you say “NTFS is unusable by non-English users” is a useful
statement?

There's plenty of room to say that a particular tool needs improvement, or
that people making systems which copy or archive files should check for
pathological cases, but it doesn't help anything to overstate the case so
broadly.

~~~
deong
The issue isn't a "bag-of-bytes" filename model. The issue is a "bag-of-bytes"
filename model combined with an inconsistent normalization scheme.

It's not a problem on Windows or Linux filesystems because Windows and Linux
don't provide a half-assed normalization scheme that lets me fairly easily
create files that can't be accessed. If the Cocoa libraries did no
normalization, then the resulting behavior might be obnoxious from a human-
interface perspective, but I don't think the article would describe it as
"little short of catastrophic".

I'm sitting here on my US English keyboard typing scancodes that look just
like they did in 1990, so I'm not the best authority on how big of a problem
it really is, but I'd guess it's going to result in a lot of bugs. Anyone
who's ever tried to use a Mac with a case-sensitive HFS+ partition should be
able to tell you that programmers can't even "normalize" their filenames
consistently strictly within their native language.

~~~
acdha
> It's not a problem on Windows or Linux filesystems because Windows and Linux
> don't provide a half-assed normalization scheme that lets me fairly easily
> create files that can't be accessed

This is only true if you're talking about the kernel APIs. Unfortunately,
filenames come from a variety of sources and it's easy to find tools which
inconsistently normalize them – e.g. simply copying and pasting a name from a
Word doc, web page, etc. which has different normalization than whatever
originally created the file – or which produce either duplicate error messages
or confusing error messages because the normalization form used in a file
doesn't match the normalization form written on disk.

I've encountered variations of this problem on all three systems. No approach
is going to handle 100% of the filenames in the wild and all of them will
require extra care in the user-interface which may or may not have been done –
e.g. the Windows Explorer still provides no way to tell why Café.txt and
Café.txt are not the same file – and fixing the cases where programs are
internally inconsistent. APFS switching will expose some programs which were
unsafe before but since it's consistent with the other common filesystems
it'll remove the need for every archive, version control, etc. system to
either special-case or break.

------
pilif
HFS+ was the only file system I know of that was doing Unicode normalisation
and is certainly was the only one choosing NFD which encodes characters in a
way that's impossible for someone to easily type in the UI.

Short-term this will cause a lot of inconsistencies with applications using
low-level APIs as the files currently existing will be NFD normalised, but any
user-given path will very likely be more or less equivalent to what NFC would
do.

So in the short term, this will be a mess (unless the APFS conversion on
install-time also does a once-time conversion of the normal form), but in the
long term, I believe this is the way to go.

And for that matter, the same should happen with regards to case-insensitivity
which is even worse as ensuring case-insensitivity might actually be dependent
on locale, which means that depending on the user's locale two file names
might or might not be identical under case-insensitivity rules.

Unfortunately, it look like there's just too much legacy code around that
plain doesn't work with a case-sensitive file-system on the mac (I'm looking
at you, Adobe).

BTW: On my 10.13 Beta 1 setup with an APFS converted boot drive, unless I
manually create a NFD encoded file name on the command line (which you can't
do accidentally), everything is NFC both in the UI and on the command line.
This also applies to files I haven't touched since the conversion to APFS.

~~~
xoa
> _HFS+ was the only file system I know of that was doing Unicode
> normalisation_

I've seen something in a couple comments now to this effect, does nobody here
do anything with ZFS at all? It's a pretty great filesystem on not just
illumos but FreeBSD, Linux, and macOS, and it has full native normalization
support (and for all 4 forms too). Normalizing is optional, but it's
definitely there and with Macs I have been using NFD for compatibility
purposes for 5 years now with no discernible performance problems in day-to-
day use. I know ZFS isn't remotely a majority but I don't think it's _that_
obscure either.

~~~
cryptonector
Thank you for pointing this out.

In particular ZFS has normalization-preserving/normalization-insensitive
behavior, which is far superior to HFS+'s opinionated normalization-on-create
(to a form that is different from the common input modes' output!).

~~~
jorangreef
Yes, HFS+ implements normalization-insensitive behavior through not being
normalization-preserving, in the same way that some FATs might implement case-
insensitive behavior through not being case-preserving. They didn't realize
that you could have both features: normalization-preserving and normalization-
insensitive.

~~~
cryptonector
I guess it was a very forgivable lack of imagination. When we came up with
form-preserving/insensitive we were motivated in great part by the interop
mess caused by HFS+ -- we might not have arrived there without that mess,
though I'd like to believe that someone would eventually have reached this
conclusion regardless.

------
ioquatix
The title is click-bait and over-dramatises the issue.

The choice of APFS is that a filename is a sequence of bytes. Nothing more,
nothing less (feel free to correct me if I'm wrong here).

If you want to see the kind of issues that path normalisation brings, check
out this: [https://github.com/thibaudgg/rb-
fsevent/blob/master/ext/fsev...](https://github.com/thibaudgg/rb-
fsevent/blob/master/ext/fsevent_watch/FSEventsFix.c)

I'd like to believe that most developer would prefer the current behaviour
over that proposed by the article (normalisation).

If there are issues in Apple's Finder or other high level APIs, I'd like to
believe they will be fixed before the final release. These are fixable bugs,
IMHO.

~~~
kalleboo
The problem seems to come from the high-level APIs doing normalization. So if
you have two files in a directory, one normalized and one not-normalized and
open the non-normalized file, the high-level API will then normalize that
filename and you open the wrong file.

I've always disliked the practice of messing around with file paths (storing
them, concatenating them, etc). I preferred the way that the Classic MacOS
typically dealt with filesystem references and aliases instead of file paths.
This also meant users can rename or move files and references still work.

~~~
rndgermandude
Having two sets of APIs, one that does not mess with paths and one that does
or adds other constraints, will cause all kinds of funky problems. For APFS it
will be high-level-API using applications not being able to open certain files
or using the wrong file[0], like stated.

You got a similar siltation with NTFS where the file system supports long
paths while the userland WinAPI does use a much smaller length limit (unless
you jump through "\\\?\ hoops, which does not work for relative paths), which
renders certain files "unopenable" by certain applications.

[0] Regarding opening the wrong files, anybody remember the Android zip
vulnerabilities?
[https://googlesystem.blogspot.com/2013/07/the-8219321-androi...](https://googlesystem.blogspot.com/2013/07/the-8219321-android-
bug.html) It's not that hard to imagine that some macOS software does
security/sanity checks on files using the low-level non-normalized API but
then opens the (wrong and unchecked) file later with normalizing high level
API, or vice versa. Having this API discrepancy built into your OS certainly
makes these kinds of things more likely.

------
cryptonector
I've been saying this for years: [http://cryptonector.com/2010/04/on-unicode-
normalization-or-...](http://cryptonector.com/2010/04/on-unicode-
normalization-or-why-normalization-insensitivity-should-be-rule/) (originally
at blogs.sun.com, now blogs.oracle.com, though I can't find it there).

The problem is that most input methods produce something close to NFC while
HFS+ decomposes to something close to NFD. Which means that if you cut-n-paste
non-ASCII Unicode names from a finder into any app that doesn't normalize,
then you'll have problems.

The solution we came up with for ZFS was normalization-preserving but
normalization-insensitive name comparison and directory hashing. This produces
the best interoperability via NFS, SMB, WebDAV, and so on, and local POSIX
access.

~~~
konstmonst
I also think encoding doesn't belong into a file system. Let the names be
arrays of bytes and leave the encoding to the people that use it, be it utf-8,
utf-16 or something entirely different.

~~~
rtpg
the problem that arrives there almost instantly is that people want to see a
list of filenames.

If you're ever going to do some sort of "displaying" of data, you cannot store
it as bytes. You need to know what characters things are supposed to be
presented as.

You could imagine not settling on a specific encoding, but you must know the
encoding. Unless your plan is to show a list of numbers to users.

~~~
thaumasiotes
The filesystem can't be responsible for displaying anything, though.
Displaying filenames is the job of the shell / window manager / etc.

The filesystem is much better off handling filenames as a number of arbitrary
bytes. Let people who want to put weird bytes in their filenames see ugly
filenames along the lines of "\x00 Can you see this?"

~~~
rtpg
OK so I move a file from my hard drive to a USB drive, and give it to a
friend.

Can they not read the filenames anymore because their shell/window manager is
different?

~~~
cryptonector
Right!

Or worse, you speak multiple languages, or learn new ones, and need to...
switch codesets? How do you then access your old files?!

We must switch to Unicode. Full stop. If there are imperfections in Unicode
script support, then we must fix those, but otherwise we must adopt Unicode.

------
_jomo
The actual problem is not APFS but Apple's programs doing (and their advice to
developers to do) normalization. It wouldn't be a problem if programs just
used the file names given to them.

In fact, I think the normalization HFS+ does is more problematic. For example,
fish shell can't complete file names when you use un-normalized characters in
the input. [0][1]

Edit:

0: "Unicode normalization issues with HFS+" [https://github.com/fish-
shell/fish-shell/issues/474](https://github.com/fish-shell/fish-
shell/issues/474)

1: "Completion does not work for special characters" [https://github.com/fish-
shell/fish-shell/issues/1794](https://github.com/fish-shell/fish-
shell/issues/1794)

~~~
cryptonector
[http://cryptonector.com/2010/04/on-unicode-normalization-
or-...](http://cryptonector.com/2010/04/on-unicode-normalization-or-why-
normalization-insensitivity-should-be-rule/)

------
markonen
Apparently on iOS 11, even the case-sensitive variant of APFS will be
normalization-insensitive. Previously it looked like this would only be the
case on macOS's case-insensitive APFS variant.

Anyway, it looks like the issues raised in this (April) blog post will not
actually apply to iOS 11 or macOS High Sierra.

~~~
cryptonector
Oh, that's great!

I've been saying for years, to anyone who will listen (and many who won't!)
that normalization-preserving/normalization-insensitive is the only sensible
behavior for a filesystem.

(I've said this many times in the IETF in the context of NFSv4 and WebDAV and
such. Every time I've noticed the subject of stringprep for filesystem
protocols come up.)

[http://cryptonector.com/2010/04/on-unicode-normalization-
or-...](http://cryptonector.com/2010/04/on-unicode-normalization-or-why-
normalization-insensitivity-should-be-rule/)

------
kuon
I'm happy that the filesystem treats name as sequences of bytes. Normalization
should happen at a higher level.

~~~
deong
"Happening at a higher level" is a reasonable solution, if it happens
_consistently_ , no matter which higher level you're using. If you have 18
different functions to open a file, and 11 of them normalize and 7 don't, then
you're screwed before you even get out the door. Programmers simply aren't
capable of getting this right in a consistent way if asked to solve the
problem application by application.

~~~
kuon
Higher level doesn't mean in app.

What I mean is that unicode normalization is really hard, and it should be
it's own module that can be used regardless of the fs.

app->fopen->unicode normalization->APFS/HFS/FAT...

~~~
deong
In that case, I agree. The problem here seems, from my understanding, to be
that what Apple did was more like

    
    
        app -> CFFile -> unicode normalization -> fopen -> APFS
    

which screws you because anyone can just call fopen on their own without using
the core foundation libraries, leading to inconsistent states in the
filesystem. You can be higher level than the filesystem, but only a little
bit. You can't be higher level than some API that developers will regularly
use (unless you do like ZFS and normalize at lookup rather than create).

~~~
kuon
I guess the ZFS approach is the best one.

------
robjwells
From the "What's new for developers" in macOS 10.13 High Sierra document [0],
case-sensitive APFS can be normalisation-insensitive:

> APFS now supports an on-disk format change to allow for normalization-
> insensitive Case Sensitive volumes. This means that file names in either
> Unicode NFC or NFD will point to the same files.

Which means that both versions support normalisation-insensitivity.

(Edit: There is also a one-line mention of this in the iOS 11 document [1] but
it doesn't say if it is the default.)

[0]:
[https://developer.apple.com/library/content/releasenotes/Mac...](https://developer.apple.com/library/content/releasenotes/MacOSX/WhatsNewInOSX/Articles/macOS_10_13_0.html)

[1]:
[https://developer.apple.com/library/content/releasenotes/Mac...](https://developer.apple.com/library/content/releasenotes/MacOSX/WhatsNewInOSX/Articles/macOS_10_13_0.html)

~~~
robjwells
The link to the What's New in iOS 11 page is wrong, it's here:
[https://developer.apple.com/library/content/releasenotes/Gen...](https://developer.apple.com/library/content/releasenotes/General/WhatsNewIniOS/Articles/iOS_11_0.html)

APFS is mentioned at the very end.

------
asimpletune
"Unusable" is a strong word to use. Should filesystems be making up for our
Unicode shortcomings? From a SW design perspective, is that the most sensible
place to pass the burden of responsibility? I would say that another way to
handle it is to store a file name as an array of bytes and put the burden on
software developers to interpret Unicode correctly. Swift does this pretty
nicely.

I would say the only downside to this approach, is a user wouldn't be able to
distinguish two files with the same name apart, but it's hard to imagine how
they'd get to creating such a situation in the first place without the
developer rule above being violated.

~~~
deathanatos
> _Should filesystems be making up for our Unicode shortcomings?_

Absolutely, yes. File _names_ are text by their very definition; that we've
been treating them as "bags of bytes" is a historical tragedy. At the very
least, file names need to be displayed, as text, to the user, so they should
be stored _as text_ , that is in some well-defined encoding, and yes, it
should be the job of the filesystem driver / kernel to enforce that it's not
writing garbage out to disk.

Reinventing that wheel in every system that in any way interacts with the
filesystem is bad engineering, and doomed to fail.

Further, I don't see why the typical user should need to know or understand
the differences between 'e\N{COMBINING ACUTE ACCENT}' and '\N{LATIN LOWERCASE
E WITH ACUTE ACCENT}'. Likewise, I don't see why each and every piece of code
should be forced to handle that. Developers _will_ get this wrong. In fact,
the article seems to say even Apple can't get it right, in that Finder will
not correctly show the directory contents in some instances, and fails to open
files in some instances, telling users the file "doesn't exist".

~~~
konstmonst
But what is text? Not everyone wants to use unicode. It is dependent of the
platform, the region, the OS and on many other different things like LC_*
variables on linux. Why should a filesystem depend on those too?

~~~
deathanatos
> _and on many other different things like LC__ variables on linux.*

The point is that it _doesn 't need to be_. I would entertain that not
everyone might not want to use Unicode: in that case, the FS should _still_
have a well-defined encoding, such that I can still arrive at a string to
display to the user. The point is not that "Unicode is best" but that storing
file names as "bags of bytes" is incredibly user unfriendly, and there needs
to be a straightforward, no bullshit method to display and transmit the names
of files.

But I would also argue that Unicode is the best we've got presently, and it
would be pragmatic for a filesystem to simply adopt it outright. It's
_overwhelmingly_ the dominant character set in use today, especially if you
ignore deprecated junk that Unicode is a strict superset of.

------
loeg
> Currently, the Mac Extended file system, HFS+, uses Normalisation Form D
> (NFD). Under that, é and é are automagically converted to é, and
> represented as three bytes, 65 cc 81.

This is actually an oversimplification. Mac NFD does not decompose characters
in a few specific ranges[0]:

> U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF
> are not decomposed

But I do expect applications built for HFS+' pseudo-NFD to struggle with non-
normalizing APFS.

[0]:
[https://developer.apple.com/library/content/qa/qa1173/_index...](https://developer.apple.com/library/content/qa/qa1173/_index.html)

------
foota
Would be interesting to know why Apple made the decision not to normalize on a
file system level then. Just an argument based on separation of concerns?

~~~
vetinari
Normalizing on the file system level in the past was a mistake. Normalizing
Unicode is not an easy task, it belongs into userspace and not into a kernel
driver.

Other systems consider filename a string of octets and interpretation what
these octets mean is left for userspace to decide.

For Linus Torvalds colorful opinions on HFS+, see here:
[https://plus.google.com/+JunioCHamano/posts/1Bpaj3e3Rru](https://plus.google.com/+JunioCHamano/posts/1Bpaj3e3Rru)
(in the comments).

~~~
rurban
Others have different opinions. Normalization of Unicode identifiers is
necessary for identifiers to be identifiable. Mostly this is only done for
domain names, I also do it for identifiers in my programming language, and
consequently path names and directory entries should also identifiable. The
garbage in - garbage out camp completely ignores all unicode security concerns
and simply shifts blame to user-space. I applauded Apple to use normalization
in HFS+, but the APFS integration is just horrible. 1st it makes unicode paths
insecure again, and 2nd it's inconsistent.

NFD over NFC is simply trading data over CPU. NFC requires 3 passes over a
string, NFD only 2, whilst NFD is usually a few bytes longer. Linus talking
about NFD corrupting data is of course nonsense. He should stay technical and
only rant about things he has an idea about. The python3 NFKC normalization
format is nonsense, but not really important. NFD is fine, because faster.

Shifting interpretation of byte encodings from utf-8 to user-space is typical
Linux non-sense, but he inherited the existing mess. Using utf-8 is miles
better than bytes.

Case-insensitivity in HFS+ is of course legacy nonsense. This should have been
to target to get rid of, not normalization.

The 2nd big security issue would be to forbid mixed scripts in a name
(filename). I blogged about it here, and OP uses the same examples I used:
[http://perl11.org/blog/unicode-
identifiers.html](http://perl11.org/blog/unicode-identifiers.html) Invisible
combining marks, right-to-left overwrites or similar / spoofs security
problems are only fixable with normalization and more TR31 (mixed scripts,
confusables), though confusables can only be handled in user-space, not the
FS.

------
needusername
This is going to be fun with the java.nio.file API. Java currently normalises
everything to NFC even though the file system on macOS is NFD. This was
introduced some time in Java 8 because reasons.

Of course this is all going to break now, of course it will take years until
Oracle fixes this.

------
therealmarv
So, can somebody answer this, is this right on APFS or not?

    
    
        rsync -rltv --iconv=utf-8-mac,utf-8 from_a to_b
    

and similar sshfs:
[https://github.com/osxfuse/sshfs/issues/14#issuecomment-1859...](https://github.com/osxfuse/sshfs/issues/14#issuecomment-185913726)

I need to use that iconv options everytime when dealing with umlauts and
mac<->linux communication. So do I still need that on APFS or not?

~~~
loeg
Per pilif[0], NFD filenames are converted to NFC during HFS+ to APFS
conversion:

> BTW: On my 10.13 Beta 1 setup with an APFS converted boot drive, unless I
> manually create a NFD encoded file name on the command line (which you can't
> do accidentally), everything is NFC both in the UI and on the command line.
> This also applies to files I haven't touched since the conversion to APFS.

[0]:
[https://news.ycombinator.com/item?id=14496538](https://news.ycombinator.com/item?id=14496538)

------
the_mitsuhiko
Sadly this was expected but it's not clear what Apple wants developers to do
about this now. I am very happy that normalization is removed from the FS
layer but I wish they had removed it from Cocoa as well or at least picked a
different normalization form.

The fact that Cocoa apps create different filenames than the terminal is not
great. It's even worse that some UI (like Finder) seem to cause even more
confusion is not helping.

------
alkonaut
Wait why is a file name a sequence of bytes? Why is it not a sequence of bytes
that form a valid string in some encoding, so that you can just refuse some
file names? Who refuses invalid characters such as invisible characters or
"/"? Would not the same place be a good place to refuse anything that doesn't
normalize to the same byte sequence, or an even stricter subset?

~~~
cryptonector
Supporting more than one encoding requires knowing what that is, and apps and
APIs are very bad at keeping track of that. Which is why only Unicode in some
UTF makes sense.

ZFS allows you to store non-UTF-8 if you like, but for all valid UTF-8 it
implements normalization-preserving/insensitive behavior, which is the best
possible compromise (IMO).

------
nottorp
To be honest, I blame Unicode. Why allow different representations for the
same character, and then provide a normalized form anyway, except it's not one
normalized form but several? Sounds like job security to me.

~~~
Piskvorrr
Blame the encodings that came before it: Latin-1 and Windows-1252, and all the
other ones. Hindsight is always 20/20, y'know.

~~~
jrimbault
I'm just being sarcastic: should we also blame the computers of old for not
being able to handle more than 255 characters ?

~~~
cryptonector
No, just developers. Of course, the use of bytes for characters goes back a
long time, to times when computers had small memories and disk (and other)
storage capacities. And to even before then, to the days of telexes and
typewriters. It's completely understandable. But UTF-8 is genius, which is why
we use it.

Incidentally, ASCII was actually a multi-byte codeset... since one could
combine most lower-case characters with BS (backspace) and overstrike with
apostrophe, backtick, tilde, comma (for cedilles), and double-quotes (for
umlauts), or with the same char for bold, or with underscore for underline.
nroff(1) still uses this for bold and underscore, no?

~~~
thaumasiotes
I've always wondered what the concept of a "backspace" character was supposed
to mean. (I tried printing it to erase a previously printed character, but
that doesn't work.)

I guess it's another relic of the idea that computer output goes to a printer
rather than a display?

~~~
cryptonector
Yes! Terminal vendors had to explicitly support this in terminals, though
maybe not just because of printers but because it's actually quite useful.

Note that BS only produces overstrike when followed by certain characters
(which we might term "combining" for the fun of it), while most will just
change the character at that location. A tty spinner is just |BS/BS-
BS\BS|BS... with some delay between each -- no overstrike there.

------
jorangreef
For tips on working with different filesystems, especially filesystems that
have different approaches to normalization, see an article I wrote:

[https://nodejs.org/en/docs/guides/working-with-different-
fil...](https://nodejs.org/en/docs/guides/working-with-different-filesystems/)

The gist is that normalization should only ever be used for comparison (if
needed, e.g. "do these two files have filenames that would look the same to
user"), and never for changing data (filenames are user data and should be
stored verbatim without normalization). HFS+ should never have used
normalization in the first place. You can think of normalization as
essentially a lossy hash function (you cannot get back to HFS+ NFD form once
you normalize it to NFC - HFS+ NFD and NFD proper are not the same thing -
many people don't realize this). Using normalization for anything other than
temporary comparison leads to data loss.

------
kccqzy
Did anyone not read the documentation? I'm pretty sure this will be a non-
issue soon.

> How does Apple File System handle filenames?

> APFS has case-sensitive and case-insensitive variants. The case-insensitive
> variant of APFS is normalization-preserving, but not normalization-
> sensitive. The case-sensitive variant of APFS is both normalization-
> preserving and normalization-sensitive. Filenames in APFS are encoded in
> UTF-8 and aren’t normalized. […]

> The first developer preview of APFS, made available in macOS Sierra in June
> 2016, offered only the case-sensitive variant.

------
netheril96
Do "most non-English" languages have normalization issues? At least CJK users
do not.

~~~
rurban
Actually Korean (Hangul) is the biggest user of NFD normalization, with very
special logic. Hangul also has the only 2 still remaining identifier bugs in
the unicode 9 database. HANGUL FILLER and HALFWIDTH HANGUL FILLER wrongly
being valid ID_Cont, as such wrongly usable as identifier characters (such as
in filenames).

Almost every script but Latin-1 has TR31, TR36 and TR39 issues.
[http://www.unicode.org/reports/tr39/](http://www.unicode.org/reports/tr39/)

~~~
cryptonector
Hangul has composed and decomposed forms, but only the decomposed forms are
used as canonical, even in NFC.

(Incidentally, NFC is closed to new compositions, though new compositions can
be added to Unicode.)

------
tehabe
Personally I think the current Normalisation Form D is awful, storing an ü as
two characters is really annoying and even bash can't really deal with it in
the version Apple uses. I really hope APFS will fix this. But we'll see.

~~~
cryptonector
It's awful in great part because common input modes produce something close to
NFC.

~~~
kps
Apple could of course change their keyboard layouts to produce something close
to NFD. (OS X has always allowed keys to produce multi-character results.)

~~~
cryptonector
But everyone else's input modes would still produce something close to NFC.
Interop with the rest of the world matters. OS X is not the only OS.

------
callesgg
I dont expect my computer to treat charcters that look the same equally.

~~~
cryptonector
Even when they are the same character according to Unicode?!

(I understand not treating confusable characters as the same. That's a
different story.)

~~~
callesgg
No, in my mind a filename should be an identifier made of a bunch of bytes.
How you represent those bytes is not important in the context of the file
subsystem.

For me it seems more like a problem with Unicode. I can see why it is the way
it is from a certain perspective. Very connivent.

But it has broken the underlying abstraction layer.

~~~
cryptonector
There was no underlying layer, just lots of U.S.- or Western-centric
assumptions. Unicode breaks them, but so did every codeset (Shift-JIS, this,
that, and the other).

------
cjensen
A pet peeve of mine is the use of the word "unusable" when reporting a
behavior which the reporter finds undesirable. Apple is just now doing what
Linux has always done. Is Linux unusable?

Keeping your issue reports neutral and avoiding hyperbole is an underrated
skill.

------
burnbabyburn
I really fail to see why do you want your file system to normalize utf8 chars
with hfs+ rules.

~~~
loeg
Normalizing to NFC form would make some sense, and was conceivably an option.

~~~
cryptonector
Normalization-preserving/normalization-insensitive behavior makes even more
sense!

------
INTPenis
I'm glad this is getting attention. I've been working around this issue for
over a year now, assuming that "if it's broke, Apple will know, and they will
fix it".

Apparently that was a wrong assumption.

------
bsaul
i guess we'll see a new option on the finder "show non-normalized files" next
to "show hidden files" pretty soon. i don't see how they could automatically
solve all the potential mismatch issues otherwise.

~~~
cryptonector
ls | od -c?

