Hacker News new | comments | show | ask | jobs | submit login
APFS does not normalize Unicode filenames (mjtsai.com)
313 points by okket on Mar 25, 2017 | hide | past | web | favorite | 137 comments



I agree that this is a good change. Unicode, normalisation, character encodings, etc. should really be handled at the presentation layer, and everything below that just treats filenames as sequences of bytes, perhaps with one or two exceptions like '/' and \0.

It is interesting to consider a theoretical system in which paths are represented in 0-terminated count-length format (e.g. "foo/bar/baz/myfile.txt" would be "\003foo\003bar\003baz\012myfile.txt\000"), truly allowing any byte in a filesystem node's name, although that might be going a little bit too far.

Things are much easier for the file system if it can just treat names as bags of bytes.

If you're really talking about bags (unordered sets), that would certainly make for an interesting filesystem since filename.txt, filemane.txt, and maletent.fix would all be the same...


This feels in some sense like punting the problem. What exactly should the presentation layer do when presenting two files, where the first is named with a precomposed character sequence, and the other has the same name but decomposed? Surface the normalization form the user? Uh, no...

The more fundamental question is whether filenames are under control of the user or the system. The answer today is "both": there's blessed paths /System/Library... and non-blessed paths like ~/Documents/Pokemon.txt. Addressing this properly means reifying that distinction: making apps always be explicit about whether they're working with the user's or the filesystem's view of a file.

While we're here, why the should the user even have to name every file? Do you name every piece of junk mail on your desk/kitchen island?

The ideal for the user is something like tagging, where naming stuff is optional, names are forgiving and perhaps not unique, and files are not restricted to being in one directory. Meanwhile, filesystem names are an implementation detail and the filesystem enjoys Unicode ignorance. Spotlight moved a bit in this direction, but the end goal is still awfully far away.


> What exactly should the presentation layer do when presenting two files, where the first is named with a precomposed character sequence, and the other has the same name but decomposed?

Show two files with apparently identical names. Is that so surprising? There are many many ways for two different Unicode strings to look visually identical (or near-identical) even if they aren't equivalent under normalization. For that matter, even if you were limited to plain keyboard ASCII, you could have two really long filenames that only differ in one character, which are effectively indistinguishable without massive hair-pulling. If you want to be robust, there's no way around having means to identify files other than names.

(I agree with your sentiment about tagging.)


Google drive allows you to have identically named files in the same directory. Doesn't quite map onto a desktop filesystem, but the filename is not a unique identifier, but just a convenience for the user or application.


Even with normalization, you can just sprinkle an arbitrary selection of non breaking zero width space characters into a string and create an infinite variety of identical-looking but distinct codepoint sequences. Treating filenames as bundles of bytes is fine, but at that point it seems dangerous to assume that interpreting them as encoded strings is a valid way of distinguishing them.


You are talking about confusables. Unfortunately normalization cannot handle this. Only certain apps treat confusables as confusables: email, web browsers, dns services, usually via libidn.


> For that matter, even if you were limited to plain keyboard ASCII, you could have two really long filenames that only differ in one character, which are effectively indistinguishable without massive hair-pulling.

My initial response was to not present confusing data to the user, but then I thought of exactly this. In many fonts, lowercase "L" (l) and uppercase "i" (I) look the same. I'm sure in some fonts they actually are identical, rather than just being nearly indistinguishable to the human eye.

To the discussion is not about whether we should present confusing identifiers to users, but at what point if it becomes more common does it cause enough of a problem to be a bad choice, and what techniques are there to mitigate it?


The more fundamental question is whether filenames are under control of the user or the system. The answer today is "both": there's blessed paths /System/Library... and non-blessed paths like ~/Documents/Pokemon.txt. Addressing this properly means reifying that distinction: making apps always be explicit about whether they're working with the user's or the filesystem's view of a file.

This distinction is important for other reasons, too. Should this file be indexed for search? Should it store an access time to support features like "recently opened"?

On the other side, we want to know which blessed paths "belong" to which system application or service so we can sync them, protect access to them, and uninstall them with their owners.

Maybe the user-visible document store should be a separate mechanism entirely from low-level app/service data persistence.


Microsoft tried to do something like your ideal with WinFS but couldn't make it work and abandoned the project.


I'd argue: Use a langauge with generics and make the path type a ((de-)serializable) black box.

You still need to make a decision in the end, but are much more free to change it.


This is not a language issue.


I didn't say it was. I said generics would help one easily switch their choice on the different semantics, not avoid choosing at all.


Many, or most, languages already do that by representing filenames as native strings (which are usually abstract "Unicode" strings).


I want to swap the path type used by the FS implementation. This is totally distinct from a langauge providing some string type which while abstract is nevertheless fixed and unchangable.


Both Objective-C and Swift have (a form of) generics on board.


It's a good idea until you end up with two files that have the "same" name (eg. Amélie.jpg and Amélie.jpg) because one uses decomposed characters (U+0065 and U+0301) and the other one uses a single character (U+00E9).

If the difference is not visible in your browser (it shouldn't), try copy-pasting those two filenames in a text editor, one of them is 10 characters long and one of them is 11 characters long.


It can get worse - imagine a program that normalizes all paths before handing them off to the fs. Then you go to delete the denormalized file, but it deletes the normalized one. With this change, the only safe way to handle paths would be to NOT normalize at all, but Apple are recommending the exact opposite. I predict a lot of confusion will arise from this change.


The problem exists already on the visual level, since there are pairs of distinct Unicode characters that look very much alike.

I think the only thing you can really do about that is restricting to ASCII minus control characters. Probably many programmers would be willing to accept that, but non-technical users not.

The next best thing would be to enforce a canonical (code-point?) encoding. But given the complexities of Unicode that will get us only so far...


> The next best thing would be to enforce a canonical (code-point?) encoding.

That's exactly what HFS+ did, and what APFS doesn't.


> The problem exists already on the visual level, since there are pairs of distinct Unicode characters that look very much alike.

Yes, but software doesn't consider them equivalent.


The software also doesn't consider the two strings given as an example above equivalent, and for the same exact reason: they're sequences of different code points.



That's not about filenames.


>It's a good idea until you end up with two files that have the "same" name (eg. Amélie.jpg and Amélie.jpg) because one uses decomposed characters (U+0065 and U+0301) and the other one uses a single character (U+00E9).

It's still good then. And several systems allow for that just fine, including VMS (uniqueness comes from more than the filename). This is more alike real world folders (which can have identical items, e.g. two copies of the same paper), and is also an excellent way to keep different document versions (keep the name the same, change the date shown).


Are you also against case sensitive file systems? Otherwise you can end up with two files that have the "same" name - eg. anne.jpg and Anne.jpg). Does normalization cover such a case?


I think, the problem with Unicode normalization is, input methods are (well, conceptually, at least) meant to produce text, not binary. If filesystems are using binary data for filenames, there can be a case when is really no way to address a file by typing its name, even if you can type in that language. This isn't an issue for case sensitivity or alike.


>If filesystems are using binary data for filenames, there can be a case when is really no way to address a file by typing its name, even if you can type in that language

You could do that trivially in UNIX since forever.


Don't know about other UNIXes, but at least on GNU/Linux, neither IMEs nor filesystems are working with text data - it's all binary strings (with a few restrictions, like unacceptability of NULs). The only place where those binary strings are converted to text is when they're rendered (and this may cause some encoding-related oddities). So, sure one can do that.

I would've actually preferred for identifiers to be Unicode text strings.


Filesystems themselves treating file names and paths at bits of binary data is the right way to do it, normalization shouldn't be handled directly by the filesystem as it adds complexity and duplication of effort.

Ideally, the VFS layer should be handling all this garbage.


Programs often allow you to type arbitrary binary values using a keyboard, either by holding a keyboard control key and typing the numeric value, or prefixing it with '\'.


For ascii, I would gladly do this There may be edge cases outside of the English language that I'm not aware of.

I am somewhat willing to accept the value of case sensitivity in identifiers for programming languages (though it's often related to the verbosity of the language, as in Java: "Foo foo = new Foo()"), but not in file naming.

Of course, I realize that the ship has sailed, and we're stuck with case sensitivity in file systems.


No good has ever come from allowing BEL and DEL as part of filenames.

I'm with David Wheeler: https://www.dwheeler.com/essays/fixing-unix-linux-filenames....

We need to limit filenames for the good of the entire system and the whole community. Filenames as byte strings may sound good, but nobody ever thinks of the costs and the scant benefits.


I disagree completely. What we need is to allow even slashes in the file names – not restrict (for what an end user may feel arbitrarily) even more writeable characters off file names, just so we can avoid fixing our replacing our tools or writing one line more in a shell script.


Any solution you come up with must work with replacing existing software piecemeal, or it's no solution.


> Things are much easier for the file system if it can just treat names as bags of bytes.

And much, much harder for applications if that "bag of bytes" is an invalid UTF-8 sequence. You will end up with an invalid string (or an exception), and trying to open that file will then fail.

I'd really hope that Apple checks that the filenames are valid UTF-8 as they otherwise can end up with very interesting security bugs.


> And much, much harder for applications if that "bag of bytes" is an invalid UTF-8 sequence.

It makes copying-and-pasting filenames as Unicode unreliable, for one thing.


You can handle that fine provided you don't assume that.

> I'd really hope that Apple checks that the filenames are valid UTF-8 as they otherwise can end up with very interesting security bugs.

I think that's a bit pessimistic as Linux has run for year like that without any real issues.


At least valid utf-8 isn't difficult for a filesystem to enforce.


I'm not so sure: the thing is, those strings are not just a bag of bytes. They have semantics. Is it legal to include bytes that are 0? How about slashes in file names? How about byte sequences that aren't utf8 at all?

These things aren't just byte streams, they have semantics.

So one risk of treating this as "just bytes" is that bugs will introduce byte sequences that aren't utf8 at all, which will cause other programs to fail, or worse, to "try" and thereby corrupt the data further.

Another risk is that since it's supposed to be utf-8, some programs may do canonicalization internally to avoid confusing situations. This may even happen accidentally (though it's not likely): after all, a unicode-processing system could be forgiven for transparently changing canonicalization. But if a program canonicalizes you can now get really weird behavior such as opening a file, then saving it, and ending up with two files that look identical - because the write wrote to a path that was canonicalized.

Additionally, even though you can never avoid confusing paths entirely without considering the glyphs rendered, you are losing a very simple check against a fairly large class of errors.


No. Ever heard about http://websec.github.io/unicode-security-guide/

Identifiers should be identifiable. If a filename is encoded in utf8, it needs to be normalized. To the canonical form of course not the crazy Python 3 or Apple idea of NFD. Which is also slower. If it's encoded as bytes you get garbage in - garbage out.

There's much more to consider, but unfortunately you cannot restrict a directory to forbid mixed scripts or confusables. I summarized a few problems at http://perl11.org/blog/unicode-identifiers.html


(Offtopic) You may want to correct that blog post, the point about Japanese writing systems. "[...] Japanese using characters from Chinese (Kanji/Han), Katagana (modern japanase) and Hiregana (the old middle-age script used by women)" is really incorrect. Suggest to just say that modern Japanese uses both logographic (kanji, originated from Chinese hanzi[1]) and syllabic (kana) characters simultaneously, with two distinct types of kana (hiragana and katakana).

[1] In Unicode, they're generally unified into a single set, via process called "Han unification". So, unlike Greek "Α" vs Cyrillic "А", the "same" character that may even look slightly differently in Chinese vs Japanese (e.g. "海"), would have a single codepoint in Unicode. But that's another story...


Ugh… yeah, agreed; reading that was painful.

To reiterate:

• Chinese characters are called hànzì¹

• modern Japanese uses many hànzì, calling them kanji² instead

• katakana³ is used in modern Japanese to write words of foreign origin (loanwords) as well as onomatopoeia and is romanized as katakana, not katagana

• hiragana⁴ is used in modern Japanese to write okurigana⁵, particles, and certain words, and is romanized as hiragana, not hiregana

Here’s a typical (and extremely simple) sentence in modern Japanese that uses all three:

この文はサンプルです。

This is a sample sentence.

この: this (hiragana)

文: sentence (kanji)

は: particle (pronounced wa) (hiragana)

サンプル: sample (loanword; pronounced sanpuru) (katakana)

です: particle (pronounced desu) (hiragana)

――――――

¹ — https://en.wikipedia.org/wiki/Chinese_characters

² — https://en.wikipedia.org/wiki/Kanji

³ — https://en.wikipedia.org/wiki/Katakana

⁴ — https://en.wikipedia.org/wiki/Hiragana

⁵ — https://en.wikipedia.org/wiki/Okurigana


> です: particle (pronounced desu) (hiragana)

(While we're being pedantic)

です isn't a particle, it's the imperfective (present/future) polite form of the copular verb (to be, in English). The only particle in that sentence is は.

And as a further side note, some words or names normally written in kanji or hiragana are ocassionally written in katakana for emphasis or to emphasize a foreign nature.


Thanks, appreciated. I only have two "Learn Japanese" and "Learn Korean in x days" books for dummies.

Fixed


> No. Ever heard about http://websec.github.io/unicode-security-guide/

Which part of it is relevant to pathnames?

> To the canonical form of course

Do you mean NFC?

> not the crazy Python 3 or Apple idea of NFD.

In what circumstances does Python 3 normalize pathnames?

> If it's encoded as bytes you get garbage in - garbage out.

What do you mean by "encoded as bytes"?

> http://perl11.org/blog/unicode-identifiers.html

This is about programming language identifies, not about pathnames, which are very different beasts.


> Which part of it is relevant to pathnames?

All parts regarding identifiers and names, if you see pathnames as publicly identifiable information. See below.

> In what circumstances does Python 3 normalize pathnames?

None. Python 3 normalizes identifiers to the shorter non-canonical NFKC form. Apple HFS+ normalizes pathnames to the longer D canonical form. Which is faster but takes more space. Usually space is more important than CPU.

> What do you mean by "encoded as bytes"?

What everyone else means on byte encodings: 1-1 mapping of bytes, without any further knowledge of the encoding.

> This is about programming language identifies, not about pathnames, which are very different beasts.

Partially. pathnames can also be argued to be identifiers, similar to domain names, email names, user names, language identifiers. Apple does so. Ask plan9 what they think about pathname semantics. Not everyone is in the garbage in - garbage out camp. Many systems do encode pathnames and have character restrictions. Esp. important is the popular / spoof, which people from the garbage camp don't care about. Julia is recent proponent of the garbage camp, btw.


> to the shorter non-canonical NFKC form

What definition or sense of 'canonical' do you mean? NFKC stands for "Compatibility Decomposition, followed by Canonical Composition", so it's canonical in some sense, right?


There can only be one canonical normalization, not two. NFC is the canonical one, NFKC compresses ligatures differently.


> unfortunately you cannot restrict a directory to forbid mixed scripts or confusables

That's fortunate, not unfortunate. I want the files corresponding to documents to be named according to the titles of those documents, and said titles often do mix Latin and Cyrillic scripts in a way that produces "confusables". So as a generic rule for all filenames, it's a no-go - it will significantly affect perfectly legitimate use cases.


For average user "documents" and "Documents" is the same thing. Are paths also normalized to upper or lower case? Or file system should be case insensitive?


Case insensitivity is something best addressed in userspace because:

1) History has shown that doing it at the filesystem level causes significant compatibility issues

2) Case insensitivity can present visually confusing results

3) Round-tripping between cases sometimes is not possible without loss of information

4) Some languages don't have a meaningful definition of 'case insensitive'

5) Some languages would demand a more comprehensive approach (i.e. being able to search using either kana or kanji for Japanese filenames)

So for things like case insensitivity (or #5 above) it's best to handle them at the application (or, ideally, OS user space library) level, taking advantage of information like system locale. Low-level APIs and system services can use case-sensitive filenames.

This is different from the question of whether to normalize filenames, because filenames are still visually unambiguous (at least most of the time) in UI, logs, and debuggers.

Incidentally, Windows is proof of this technique's advantages because NTFS is a case-sensitive file system while Win32 is case-insensitive. IIRC the Linux subsystem for Windows takes advantage of this.


> If you're really talking about bags (unordered sets), that would certainly make for an interesting filesystem since filename.txt, filemane.txt, and maletent.fix would all be the same...

I was thinking the same thing! Although if we're being pedantic, I think "bag" is an "unordered multiset". If it were just a plain unordered set, then "filename.txt" would also be the same as "filenam.txt".


Incidentally, your theoretical system is exactly how the DNS stores domain names.


>perhaps with one or two exceptions like '/' and \0

Aren't colons (:) an issue with MacOS as well? I think Finder and other userspace apps convert them to slashes. I suppose though, there's a case for the filesystem not caring about that.


This sounds like DJB's Netstring [1].

"hello world!" encoded as "12:hello world!,"

An empty string as "0:,"

[1] https://en.wikipedia.org/wiki/Netstring


How do you normalize arab or chinese in a meaningful way for people speaking these languages ?


You don't, there is only one unicode representation of those characters.


Interestingly that created its own annoyances for some since normalization occurs at a different stage.

Unicode and UTF-8 unify many Traditional & Simplified Chinese Characters as well as Japanese Kanji and Korean Hanja.

There were some extremely angry reactions to Han Character Unification[1], though efforts have been made to overcome many of the initial complaints.

[1] "Why Unicode Won’t Work on the Internet: Linguistic, Political, and Technical Limitations" http://www.hastingsresearch.com/net/04-unicode-limitations.s...


Not the case for Arabic. There's hundreds of ligature code points, for example ﰻ (kl), ﷵ (ṣl'm) and ﷳ (akbr). And of course ﷽


I don't believe that's literally true, some all legal and correct Japanese unicode sequences will be transformed under at least some of the unicode normalization forms, no?


So what does mac do with those ?


Use that representation? The problem only happens when there is more than one way to represent the same text.


Ok


If I were allowed one more distinction I would definitely add utf-8 only.


The same goes for case-sensitivity which, unfortunately, they see as a defect. It would be great for portability if APFS would stay case-insensitive like the Unix file systems, e.g. just a dumb layer that reads/writes bytes.


I think you mean case-aware, not case-sensitive. Just bytes like Unix leads to case-sensitive behavior.


This seems especially bad because US-based developers who don't test with unicode filenames might not come across this issue, leaving all their non-English-speaking customers broken. (Not that this excuses such developers in any way.)

It also means that, yet again, every app will need to be updated for a new version of iOS. Makes me wonder how many apps will be left behind if not updated? Thousands? Hundreds of thousands? Millions?


I'm going through apps on my phone and can hardly think what any of them would use Unicode filenames for. Say, a messenger might use user's nickname to name a history file — that would cause one-time loss of history, but not break the app. What else?

Something tells me practically no apps will be seriously affected.


Any app that stores files in iCloud Drive has a fully user-accessible chunk of filesystem; it has to give the files meaningful names and properly handle files being renamed, added, etc. from the iCloud Drive app or file picker.

So they have to deal with arbitrary filenames; on the other hand, for the same reason, they can't maintain a master list of files somewhere which would break, but have to check the actual directory contents instead. Still, things like history or links between files might be broken.

That is, unless NSFileCoordination APIs act differently wrt normalization; iCloud clients have to use those rather than accessing the filesystem directly.


> they can't maintain a master list of files somewhere which would break, but have to check the actual directory contents instead.

They should be doing that anyway. Anything else is just begging to get out of sync at some point.


You should not be using anything other than UUIDs or integers for file names. Maintain your own mapping in a database or file.

Using a network value or a value returned by an API is just asking for trouble.

If a user names a file that will be hidden behind a URL the same advice applies. If not then the user can use any sequence of bytes they want and you shouldn't care.


It's poor UI to not name a file on the user's own machine, if they'll ever have to look at it. I'm counting things like web browser caches in this because OmniDiskSweeper/etc users will be seeing the large files you put in there.


Most importantly, APFS is first being rolled out to iOS where this isn't as much of a problem because the user never really sees the filesystem anyway.


It doesn't matter much on iOS, where users rarely if at all have to access the filesystem directly. It is very rare that the same files would be accessed and modified by more than a single application anyway.


Assuming your application is the only app that will ever interact with a file is only correct in very specific circumstances. This is terrible advice in the general case.


Except filesystem experts like Dropbox, developers probably shouldn't be letting users name their files.


I'm not sure I understand. How can an app NOT allow me to name my files - unless you're talking about some iOS-like "hide the filesystem" silo.


iOS 10.3 with APFS has been in public and developer beta for several months - it's up to beta 7 right now in fact. If this were as vast a problem as Micheal Tsai presents in this post, wouldn't we (the devs and beta testers) be running in to this a lot?

Given how loudly the tech press proclaims any perceived mis-step by Apple, I'd have to believe we'd have been reading tons of 'Apple is Doooooooomed' articles about this by now. Given that this hasn't happened, and I haven't seen similar problem reports on dev forums and other hangouts, I'd lean towards there being some miscommunication or misunderstanding here.


This is not likely to be a particularly big issue on iOS, because the file system isn't directly exposed to the user (and therefore the user can't go making changes behind the app's back). There are of course still edge cases that could cause a problem, but they're going to be relatively rare.

But this may become a much bigger issue when we start using APFS on macOS.


This isn't true of apps such as viewers/readers that use App File Sharing.


Is it possible that the beta testers are largely in the US and so wouldn't have seen this issue much?


As I understand it, it may only be a problem when using the straight libc calls. But most devs just use the file system stuff that's in Apple's Foundation framework.


This used to be a pain point with git, when some developers were using MacOS and the repository had file names with accents; to git, it looked like the files had been renamed. Some time later, git added the "core.precomposeunicode" option to work around this problem.



Thank you, that's actually a very well written Wikipedia article. I'd not heard the term "normalization" used wrt Unicode before.


This is a big change. I guess they now decided that compatibility with external systems is a more important goal than end-user-friendliness.

It’s a reasonable decision to come to (especially for iOS where end-users don’t ever really interact with the filesystem directly), but it will cause quite a bit of churn in the short term.


I'm not sure if normalization is good idea (generally because Unicode is complex beast and moving that complexity inside a kernel should be carefully weighted), but I'm sure that it doesn't solve any real problem. Characters "A" and "А" looks identical, unless you're missing Cyrillic font, but they won't be normalized, because they are completely different characters. There are many more other visually identical strings. So while normalization might solve some simple problems, it's not a complete solution, so filesystem might just treat names as byte arrays and let user solve his problems.


Confusable characters look similar or the same to humans.

Canonically equivalent Unicode sequences look the same to machines.

The latter is a much more significant problem, because it can wreak havoc with interoperability.


Very true - and it's amplified by inconsistency allowing problems to spread further before being noticed. At work I deal with a lot of Bag-It archives where we have a text manifest of checksums accompanying files on disk and this reliably bites users of tools when something (archive or network transfer tool, Git or SVN, transition to/from a Mac with HFS+, etc.) causes the encoding in the manifest not to match the local filesystem, and the confusion is amplified because some tools will handle normalization differences so the bug report is “why does tool A say this file is missing when Explorer/Finder and tool B say it's fine?”


Canonically equivalent Unicode sequences look the same to machines.

Memcmp disagrees, as do the default equality operators of most programming languages in existence.


Sure, but normalisation can nonetheless happen automatically and implicitly in many places.


Rust uses separate string type for file names. I think, that's a good approach. If language normalizes strings behind your back, that's not very good.


Unicode isn't required to mess up things. Here's what baffled me for a while with NTFS. I'm pretty sure these issues are well known.

http://www.sami-lehtinen.net/blog/linux-windows-ntfs-differe...


People should not use a non-compliant file system driver to create corrupted entries. NTFS mounts are for windows machines only.


What exactly makes a FS driver compliant or non-compliant? Is there a NTFS compliance test suite, that we can run against specific implementations?

If there isn't, accusations of non-compliance are just FUD.


This is not a bug in either, it's a configuration defect. NTFS-3g used with the correct options, namely windows_names, works exactly as expected, but is then not POSIX compliant anymore.


Seems to me that Apple would be smart to hook all of the file functions and survey and/or alert on this situation somehow.

I only just learnt about this unicode normalisation recently looking at ZFS which has options for it I had never seen until reading the Ubuntu Root FS on ZFS guides which talk about setting it.


Linus Thorvalds will be happy to hear that http://www.cio.com/article/2868393/linus-torvalds-apples-hfs...


HFS+ can be configured at creation time to be case sensitive. I did so a while back. Worked perfectly except for one application which could not find it's files. So i had to create a container and Format it case in sensitive and intall the APP there...


I used to format my whole disk as case-insensitive but enough poorly written apps broke (ok, mostly Steam) that I now just have a case sensitive partition for my source code and /usr/local.


Adobe apps are notoriously incompatible with case sensitivity. If your apps work, that's great. But if Apple switched to case sensitivity by default, it would break apps.


Apple breaks apps all the time on macOS.


This is for iOS, where the app developer fully controls file naming within their sandbox. It is very unlikely that MacOS will fail to normalize because filenames there are presented directly to the user.


Wouldn't this be seen as an issue in betas? I haven't seen anything indicating this is widespread so far? Why would that be, just not wide enough deployment yet?


Mabe most beta testers are based in english-speaking countries and countries where most people are used to stay away from non-english characters and never noticed the problem? I live in Sweden and still avoid using åäö in filenames because of old habits from DOS/Atari era.


It's sad when we give up to technological limitations!

They are our tools, not the other way around ;-)


Technically the presentation-layer problem existed already with things like legacy path separators, making the Finder tell lies in the presence of colons or slashes. I suspect that normalization differences will be a little like telling two files apart when one has a trailing space, or hidden file extensions; there will have to be some distinction but maybe no easy answer.


    More generally, once APFS is deployed users can legitimately end up with
    multiple files in the same folder whose names only differ in normalization.
The initial message that starts this off seems to imply the opposite - instead, application developers should be normalizing the name before handing it to the filesystem. In that case, an application which allowed non-normalized naming would arguably have a bug.


Asking applications to enforce an OS-wide policy is asking for problems, even ignoring the existence of malware.

You wouldn't say "the OS doesn't check file attributes to see whether you are allowed to access a file; that's left to applications", either.


Don't Mac apps already have to deal with network and FAT32 drives? Or does macOS already normalize those?


Network drives are handled by the file sharing protocol, not the local file system. Fat32 is handled by a fat32 driver that does the correct thing according to fat32 rules.


And APFS drive will work according so APFS rules so I guess no worries


Not sure if APFS has such a thing, but I think I heard about it a while back:

Could they introduce a directory-level option to automatically normalize all files below that node? (Same with case-sensitivity, which I think Adobe software still has problems with.)


Is APFS still using Apple's style UTF-8 for e.g. Umlauts? I had a lot of trouble with rsync and also Samba later (filenames and folders hidden) when I discovered that Umlauts on HFS are different than Umlauts on e.g. Ext4.


What do you mean by "different"?

Umlauts on HFS+ are still Umlauts anywhere else. The only real oddity of HFS+ (beyond the fact that it does normalization at all) is that it's not using NFD, it's using a variant of NFD based on an old version of Unicode (it has to be this way because the normalization tables must be immutable or there's compatibility issues when reading drives written to from different versions of the OS). So if you take a filename and convert it to NFD, it may not be the exact same byte sequence that you get if you plug that filename into HFS+ (but in most cases it will be). But whatever byte sequence you get from HFS+ is still going to be a valid Unicode sequence.


I don't know the technical details but the UTF-8 from a Mac is different from UTF-8 on Linux.

Copy "hällö" from Mac to Linux over samba -> works fine.

scp or rsync that same file again from Mac to Linux (without special conversion options) -> you will have two "hällö" there.

The first one is recognized by Samba, the second one is invisible.


No, UTF-8 is the same everywhere. What you're confused about is the fact that for a lot of strings there are multiple different unicode scalar value sequences that represent the same user-visible string. This is the difference between composed and decomposed characters. For example, é can be represented either as a single scalar value that represents e-with-acute, or as two scalar values, a regular e followed by a combining acute mark.

On HFS+, because it normalizes, if you try to pass either form to the filesystem, it will treat them the same. But on filesystems that don't normalize (which is most of them) they'll be treated as distinct files. As a result, depending on the input you provide and whether the tools in question do any normalization, you could end up with two files that look identical (e.g. your two "hällö" files) but are different unicode sequences under the hood.

And none of this really has anything to do with UTF-8. UTF-8 is just a byte encoding scheme that can represent all valid unicode sequences.


Well one of this (the Apple one) is clearly wrong because Samba (which is not a niche software) does not work with it and this is critical for me. Just saying I'm not the only one who is confused ;)


It's not wrong. There are multiple ways to represent a lot of filenames (in particular, names with accents, umlauts, or graves). It's very plausible that the input methods on different systems are producing different sequences (composed vs decomposed). If you're storing a file with a composed sequence on your server, then copy it to your Mac's drive, then copy it back, you may end up with a duplicate file because it would have ended up as a decomposed sequence on your local drive.

But this doesn't make it wrong. You can reproduce this exact behavior with other systems too simply by changing the local filename to be decomposed. The only Mac-specific part here is that HFS+ will automatically decompose the filename.


I suppose you could store the normalization table in the filesystem superblock, if you wanted the ability to update in future versions but still have those readable on older versions of the OS.


If you are rsync'ing from HFS on a Mac to a Linux server, you can use "--iconv=UTF-8-MAC,UTF-8" to fix this problem.


Yes, I used that too :) Also some similar options on sshfs. But I'm glad I discovered this all before transferring critical data :)


Do you mean composed versus decomposed? Utf-8 allows both. Unfortunately this article doesn't explain it very well, here's a better link.

https://docs.syncthing.net/advanced/folder-autonormalize.htm...


That is due to the HFS+ normalization that APFS eliminates. The filename is stored in UTF-16 and OS X converts it to UTF-8.


Normalization and UTF-16 are completely orthogonal. It doesn't matter in the slightest whether the filesystem stores the filename in UTF-16 or UTF-8, either way it's still a sequence of Unicode scalar values, and that's what matters.


HFS does a lot more than convert to UTF-8. E.g. it will normalize ö to o+(umlaut), using custom normalization rules that nobody else uses.

If you tell HFS to store a file with an Ö in it, and then list the directory, the character Ö is nowhere to be found.


Is it stored "real UTF-16" or in USC2 wide-chars like in NTFS? The later would be a big problem.


I believe HFS+ was created before Mac OS supported surrogates.


It doesn't really matter what they do, since filesystem naming is FUBAR and has been FUBAR pretty much since UNIXv1, and possibly even earlier than that outside the UNIX family.


I want to make the gzipped contents of my file it's name and leave the actual contents blank or whatever metadata. Thanks apfs!


The way iOS abstracts the filesystem away from user-view makes this less of an issue than it otherwise would be but still a good find by the author, as an aside surely I'm not the only one who thought of [1] when I read "APFS now treats all files as a bag of bytes on iOS" ;)

[1] https://www.youtube.com/watch?v=OT7xc_XqYO8


I've been very excited about ReFS -- a real "modern" Filesystem that leaves legacy issues behind. We've been using it for large storage systems, and am hoping it will become a viable solution for everything soon. It solves most of these issues.


If Microsoft open source it with a free patent license it might be useful. Until that it is just another proprietary FS locked to a single OS.


Like APFS?


Yep.


Except cross-platform support, right?


A "modern" filesystem that lacks support for hard links? No thanks.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: