
APFS does not normalize Unicode filenames - okket
http://mjtsai.com/blog/2017/03/24/apfss-bag-of-bytes-filenames/
======
userbinator
I agree that this is a good change. Unicode, normalisation, character
encodings, etc. should really be handled at the presentation layer, and
everything below that just treats filenames as sequences of bytes, perhaps
with one or two exceptions like '/' and \0.

It is interesting to consider a theoretical system in which paths are
represented in 0-terminated count-length format (e.g. "foo/bar/baz/myfile.txt"
would be "\003foo\003bar\003baz\012myfile.txt\000"), truly allowing _any_ byte
in a filesystem node's name, although that might be going a little bit too
far.

 _Things are much easier for the file system if it can just treat names as
bags of bytes._

If you're really talking about _bags_ (unordered sets), that would certainly
make for an interesting filesystem since filename.txt, filemane.txt, and
maletent.fix would all be the same...

~~~
ridiculous_fish
This feels in some sense like punting the problem. What exactly should the
presentation layer do when presenting two files, where the first is named with
a precomposed character sequence, and the other has the same name but
decomposed? Surface the normalization form the user? Uh, no...

The more fundamental question is whether filenames are under control of the
user or the system. The answer today is "both": there's blessed paths
/System/Library... and non-blessed paths like ~/Documents/Pokemon.txt.
Addressing this properly means reifying that distinction: making apps always
be explicit about whether they're working with the user's or the filesystem's
view of a file.

While we're here, why the should the user even have to name every file? Do you
name every piece of junk mail on your desk/kitchen island?

The ideal for the user is something like tagging, where naming stuff is
optional, names are forgiving and perhaps not unique, and files are not
restricted to being in one directory. Meanwhile, filesystem names are an
implementation detail and the filesystem enjoys Unicode ignorance. Spotlight
moved a bit in this direction, but the end goal is still awfully far away.

~~~
comex
> What exactly should the presentation layer do when presenting two files,
> where the first is named with a precomposed character sequence, and the
> other has the same name but decomposed?

Show two files with apparently identical names. Is that so surprising? There
are many many ways for two different Unicode strings to look visually
identical (or near-identical) even if they aren't equivalent under
normalization. For that matter, even if you were limited to plain keyboard
ASCII, you could have two really long filenames that only differ in one
character, which are effectively indistinguishable without massive hair-
pulling. If you want to be robust, there's no way around having means to
identify files other than names.

(I agree with your sentiment about tagging.)

~~~
askvictor
Google drive allows you to have identically named files in the same directory.
Doesn't quite map onto a desktop filesystem, but the filename is not a unique
identifier, but just a convenience for the user or application.

------
QuercusMax
This seems especially bad because US-based developers who don't test with
unicode filenames might not come across this issue, leaving all their non-
English-speaking customers broken. (Not that this excuses such developers in
any way.)

It also means that, yet again, every app will need to be updated for a new
version of iOS. Makes me wonder how many apps will be left behind if not
updated? Thousands? Hundreds of thousands? Millions?

~~~
xenadu02
You should not be using anything other than UUIDs or integers for file names.
Maintain your own mapping in a database or file.

Using a network value or a value returned by an API is just asking for
trouble.

If a user names a file that will be hidden behind a URL the same advice
applies. If not then the user can use any sequence of bytes they want and you
shouldn't care.

~~~
astrange
It's poor UI to not name a file on the user's own machine, if they'll ever
have to look at it. I'm counting things like web browser caches in this
because OmniDiskSweeper/etc users will be seeing the large files you put in
there.

~~~
madeofpalk
Most importantly, APFS is first being rolled out to iOS where this isn't as
much of a problem because the user never really sees the filesystem anyway.

------
djrogers
iOS 10.3 with APFS has been in public and developer beta for several months -
it's up to beta 7 right now in fact. If this were as vast a problem as Micheal
Tsai presents in this post, wouldn't we (the devs and beta testers) be running
in to this a lot?

Given how loudly the tech press proclaims any perceived mis-step by Apple, I'd
have to believe we'd have been reading tons of 'Apple is Doooooooomed'
articles about this by now. Given that this hasn't happened, and I haven't
seen similar problem reports on dev forums and other hangouts, I'd lean
towards there being some miscommunication or misunderstanding here.

~~~
eridius
This is not likely to be a particularly big issue on iOS, because the file
system isn't directly exposed to the user (and therefore the user can't go
making changes behind the app's back). There are of course still edge cases
that could cause a problem, but they're going to be relatively rare.

But this may become a much bigger issue when we start using APFS on macOS.

~~~
mattcurtis
This isn't true of apps such as viewers/readers that use App File Sharing.

------
cesarb
This used to be a pain point with git, when some developers were using MacOS
and the repository had file names with accents; to git, it looked like the
files had been renamed. Some time later, git added the
"core.precomposeunicode" option to work around this problem.

------
nailer
For anyone else wondering:

[https://en.m.wikipedia.org/wiki/Unicode_equivalence](https://en.m.wikipedia.org/wiki/Unicode_equivalence)

~~~
eric_h
Thank you, that's actually a very well written Wikipedia article. I'd not
heard the term "normalization" used wrt Unicode before.

------
jacobolus
This is a big change. I guess they now decided that compatibility with
external systems is a more important goal than end-user-friendliness.

It’s a reasonable decision to come to (especially for iOS where end-users
don’t ever really interact with the filesystem directly), but it will cause
quite a bit of churn in the short term.

------
vbezhenar
I'm not sure if normalization is good idea (generally because Unicode is
complex beast and moving that complexity inside a kernel should be carefully
weighted), but I'm sure that it doesn't solve any real problem. Characters "A"
and "А" looks identical, unless you're missing Cyrillic font, but they won't
be normalized, because they are completely different characters. There are
many more other visually identical strings. So while normalization might solve
some simple problems, it's not a complete solution, so filesystem might just
treat names as byte arrays and let user solve his problems.

~~~
TazeTSchnitzel
Confusable characters look similar or the same to humans.

Canonically equivalent Unicode sequences look the same _to machines_.

The latter is a much more significant problem, because it can wreak havoc with
interoperability.

~~~
slrz
_Canonically equivalent Unicode sequences look the same to machines._

Memcmp disagrees, as do the default equality operators of most programming
languages in existence.

~~~
TazeTSchnitzel
Sure, but normalisation can nonetheless happen automatically and implicitly in
many places.

~~~
vbezhenar
Rust uses separate string type for file names. I think, that's a good
approach. If language normalizes strings behind your back, that's not very
good.

------
Sami_Lehtinen
Unicode isn't required to mess up things. Here's what baffled me for a while
with NTFS. I'm pretty sure these issues are well known.

[http://www.sami-lehtinen.net/blog/linux-windows-ntfs-
differe...](http://www.sami-lehtinen.net/blog/linux-windows-ntfs-differences-
and-potential-problems)

~~~
std_throwaway
People should not use a non-compliant file system driver to create corrupted
entries. NTFS mounts are for windows machines only.

~~~
vetinari
What exactly makes a FS driver compliant or non-compliant? Is there a NTFS
compliance test suite, that we can run against specific implementations?

If there isn't, accusations of non-compliance are just FUD.

------
lathiat
Seems to me that Apple would be smart to hook all of the file functions and
survey and/or alert on this situation somehow.

I only just learnt about this unicode normalisation recently looking at ZFS
which has options for it I had never seen until reading the Ubuntu Root FS on
ZFS guides which talk about setting it.

------
kalleboo
Linus Thorvalds will be happy to hear that
[http://www.cio.com/article/2868393/linus-torvalds-apples-
hfs...](http://www.cio.com/article/2868393/linus-torvalds-apples-hfs-is-
probably-the-worst-file-system-ever.html)

~~~
ben_bai
HFS+ can be configured at creation time to be case sensitive. I did so a while
back. Worked perfectly except for one application which could not find it's
files. So i had to create a container and Format it case in sensitive and
intall the APP there...

~~~
cjensen
Adobe apps are notoriously incompatible with case sensitivity. If your apps
work, that's great. But if Apple switched to case sensitivity by default, it
would break apps.

~~~
slrz
Apple breaks apps all the time on macOS.

------
alphabettsy
Wouldn't this be seen as an issue in betas? I haven't seen anything indicating
this is widespread so far? Why would that be, just not wide enough deployment
yet?

~~~
Moru
Mabe most beta testers are based in english-speaking countries and countries
where most people are used to stay away from non-english characters and never
noticed the problem? I live in Sweden and still avoid using åäö in filenames
because of old habits from DOS/Atari era.

~~~
kzrdude
It's sad when we give up to technological limitations!

They are our tools, not the other way around ;-)

------
makecheck
Technically the presentation-layer problem existed already with things like
legacy path separators, making the Finder tell lies in the presence of colons
or slashes. I suspect that normalization differences will be a little like
telling two files apart when one has a trailing space, or hidden file
extensions; there will have to be _some_ distinction but maybe no easy answer.

------
al2o3cr

        More generally, once APFS is deployed users can legitimately end up with
        multiple files in the same folder whose names only differ in normalization.
    

The initial message that starts this off seems to imply the opposite -
instead, application developers should be normalizing the name before handing
it to the filesystem. In that case, an application which allowed non-
normalized naming would arguably have a bug.

~~~
Someone
Asking applications to enforce an OS-wide policy is asking for problems, even
ignoring the existence of malware.

You wouldn't say "the OS doesn't check file attributes to see whether you are
allowed to access a file; that's left to applications", either.

------
kalleboo
Don't Mac apps already have to deal with network and FAT32 drives? Or does
macOS already normalize those?

~~~
djrogers
Network drives are handled by the file sharing protocol, not the local file
system. Fat32 is handled by a fat32 driver that does the correct thing
according to fat32 rules.

~~~
kalleboo
And APFS drive will work according so APFS rules so I guess no worries

------
killercup
Not sure if APFS has such a thing, but I think I heard about it a while back:

Could they introduce a directory-level option to automatically normalize all
files below that node? (Same with case-sensitivity, which I think Adobe
software still has problems with.)

------
therealmarv
Is APFS still using Apple's style UTF-8 for e.g. Umlauts? I had a lot of
trouble with rsync and also Samba later (filenames and folders hidden) when I
discovered that Umlauts on HFS are different than Umlauts on e.g. Ext4.

~~~
yuhong
That is due to the HFS+ normalization that APFS eliminates. The filename is
stored in UTF-16 and OS X converts it to UTF-8.

~~~
frik
Is it stored "real UTF-16" or in USC2 wide-chars like in NTFS? The later would
be a big problem.

~~~
yuhong
I believe HFS+ was created before Mac OS supported surrogates.

------
dom0
It doesn't really matter what they do, since filesystem naming is FUBAR and
has been FUBAR pretty much since UNIXv1, and possibly even earlier than that
outside the UNIX family.

------
m-j-fox
I want to make the gzipped contents of my file it's name and leave the actual
contents blank or whatever metadata. Thanks apfs!

------
rick_cheese
The way iOS abstracts the filesystem away from user-view makes this less of an
issue than it otherwise would be but still a good find by the author, as an
aside surely I'm not the only one who thought of [1] when I read "APFS now
treats all files as a _bag of bytes_ on iOS" ;)

[1]
[https://www.youtube.com/watch?v=OT7xc_XqYO8](https://www.youtube.com/watch?v=OT7xc_XqYO8)

------
Prego
I've been very excited about ReFS -- a real "modern" Filesystem that leaves
legacy issues behind. We've been using it for large storage systems, and am
hoping it will become a viable solution for everything soon. It solves most of
these issues.

~~~
asdfgadsfgasfdg
If Microsoft open source it with a free patent license it might be useful.
Until that it is just another proprietary FS locked to a single OS.

~~~
Prego
Like APFS?

~~~
asdfgadsfgasfdg
Yep.

