Hacker News new | comments | show | ask | jobs | submit login
Linus Torvalds on HFS+ (plus.google.com)
177 points by kannonboy on Jan 12, 2015 | hide | past | web | favorite | 119 comments

While I really like thristian's explanation of why case insensitivity adds massive complexity to a system: (https://news.ycombinator.com/item?id=8876873)

I have to point out that case sensitivity offloads a bunch of that complexity to the user. This is almost definitely why OS X uses case-insensitive HFS+ by default.

As a ultra-simple example: with a case-sensitive file system I can have two directories, "documents" and "Documents". Which one are my documents in? Half in one, half in the other, probably.

I'm not saying Linus at al. are wrong that case-sensitive is the way to go, but there are some reasonable arguments for trying to take that complexity back from the user and deal with it in the file system.

Source: I'm a long time mac user, and I just switched to a case-sensitive file system last year. The win of having my dev machine match my deployment environment (iOS) is bigger than any downsides I've seen yet.

I find case-insensitivity (as both Mac OS X and Windows do it) user-hostile. So, there's a counter-anecdote.

But, I believe all that proves is that in many cases, "what I'm used to" is the same as "easy and intuitive". There is often no one true way to make something complex easy and intuitive, and sometimes the "easy way" is the way you're used to (but for someone used to another way, it won't be easy).

In this case, in the absence of compelling evidence that case-insensitivity is a net win for the user, the simpler implementation is probably the right one. The number of bugs over the years caused by insensitivity and Unicode mapping and such, is probably further evidence favoring simplicity.

Also, the average Mac user doesn't even know they have a command line (seriously, I can't remember ever saying to a casual Mac user, "Open terminal" and have them say anything other than, "I don't know what that is" or "I don't have that"). There's no reason a higher layer can't make case-insensitive decisions for the user (say, searching for "the clash" will find music by "The Clash"), which doesn't require the freaking filesystem to make those kinds of guesses. Having this happen at the filesystem layer has always seemed utterly mad, to me. And, having it happen at the higher layers is how Linux and UNIX software has always handled it. Somehow we muddle through with the filesystem being entirely naive about the letter "a" being sort of the same as "A" (for humans).

How is it user hostile?

I think it is very reasonable to say that common users of the Alphabet expects "Documents" and "documents" to be the same word. I think it is even safe to assume most readers consider "john" and "John" the same name, while "Jon" and "John" are not.

I can't think of any situation where I have needed to have two differently capitalized files reside in the same folder, but I can remember dozen of times I've been annoyed by it.

My point is, there are arguments to calling case-sensetivity for user hostile, but you don't present any to counter it.

Would you expect "install" and "Install" to be the same name? In most languages you'd be right to, but if your FS were using the Turkish locale, you'd be wrong[0]. So now we need to stamp a "locale" on the file system to avoid problems -- but then how do you carry that over when you're e.g. unpacking a zip you received from a Turkish person and your own file system has a different idea of what characters are the same or not? (I suppose I should say "glyphs", but even that's quite iffy in unicode terms.)

I'm not sure that having such a fundamental thing as whether two file names are the same be dependent on such complex things as unicode, canonical forms, etc. is a good idea.

I won't comment on whether it's more user-friendly or hostile -- I have no data to back that up anyway. It's ridiculously complex and that's what I would base my decision on. At least using "a string of bytes" as the file name is simple conceptually and pretty hard to get wrong.

[0] en.wikipedia.org/wiki/Dotted_and_dotless_I

Obviously, the solution is to just assign a numerical index to every file, and require filenames to be contiguous integers. Problem solved.

In reality, there is a sliding scale between usefulness and complexity in filenames, and where one person thinks the optimal trade-off point is may not match what others think. Many filesystems have chosen (rightly so in my opinion) to just allow as much as possible, and let convention and the application layer sort it out. There are many, many complexities to character normalization and casing (think unicode and languages where multiple characters may be combined into a single separate character, or where there are choices to make on the correct way to change case based on the word itself), and again the problem is to be where to draw the line at what the filesystem should do for you.

The real problem is that people think that filesystems let us name files with words, when really we are just tagging file with characters. Words, as used in language, are an entirely more complex beast, and IMHO, entirely the wrong think a filesystem should care about.

Most visibly: Tab completion can get the wrong/unexpected file (note that it doesn't have to be the same filename to cause this problem...if I'm expecting "my<tab>" to complete to "my program", because it is the only file starting with "my" but instead it completes to "My Documents" or requires a double tab and more specificity from me, that's an annoyance). Windows tab completion is perhaps the perfect storm of annoyance for me, with regard to case insensitivity. It'll complete to the first thing it sees case insensitively and then require cycling through all matches to find the desired file. This would also break file globs, or at least my expectation of what file globs should do.

This is, of course, a mild bit of user hostility, but it is less friendly and harder to use, to me, than case sensitivity. As I noted above, it's mostly a matter of what you're used to, but in the absence of a reasonable case that case insensitivity makes life easier for users, the right implementation is the simpler one.

The more dangerous stuff is where the OS decides that one file is actually a different file, through character conversion or through Unicode mapping. And, that, (not) coincidentally, is where serious security bugs were found in git. And, git is not alone in running into these problems.

In short, I find it annoying that the OS would decide it knows what I want, when I ask for something else. But, more importantly, it has been proven occasionally to be dangerous, and often to be error-prone, in implementation. But, on that front, I'm probably just going to repeat what Linus said less effectively.

It doesn't do that on OS X.

As a ultra-simple example: with a case-sensitive file system I can have two directories, "documents" and "Documents". Which one are my documents in? Half in one, half in the other, probably.

This is a source of confusion, but filesystem convention is not the place to solve it because it can't solve it. I can still have three directories for bug #12 "ticket_12", "ticket12" and "ticket 12" Which one is my investigation in? Is documentation in "doc" or "docs"? etc...

Spaces in filenames are terrible as well. Is it "Documents" or "Documents "?

Perhaps one could calculate some sort of visual distance between filenames and not allow any two in the same folder, if their distance were less than some constant.

You could tokenise them, and then compare the resulting list of tokens.

Funnily enough, I've had the same thoughts about variable names. We expect case-sensitive variable names because that's what we're used to, but they're not obviously the right choice. I've worked on reasonably sized lisp projects with case-insensitive names with no detriment.

Most of the time, having both a `foo` and a `Foo` identifier at the same time is a bad idea: it's hard to remember which is which. The only time it makes sense is when it's backed up by strong conventions: say, `foo` is a method and `Foo` is a class. With these conventions, you're not really remembering `foo` and `Foo`, you're remembering "`foo` method" and "`foo` class". In a sense, it's just a hack to have sigils use an otherwise unused part of the identifier, just like tagged pointers use the unused bits of a pointer for metadata.

Using capitalization for sigils like this is fine, and works well in practice, but you could as easily use some other symbols in the same role. At the very least, it helps to have the conventions built right into your language (like Haskell) than just followed manually (like Java). Moreover, having significant capitalization outside the conventions (ie in the rest of the word) will still just be confusing: was is `subString` or `substring`?

I don't know what the best solution is, but case sensitivity everywhere isn't it.

I've always thought case-sensitive variable names to be an odd thing. You allow your program to handle it and then tell the user to never ever use it.

> As a ultra-simple example: with a case-sensitive file system I can have two directories, "documents" and "Documents". Which one are my documents in? Half in one, half in the other, probably.

That only really happens if you're an idiot who just haphazardly throws documents into any directory that looks like it can still hold files. And it's not an indictment of case sensitivity; you can achieve the same sort of stupidity in plenty of other ways if you're determined to do so.

Idiot here. I like to save downloaded files in a directory called "downloads". Firefox has decided it would prefer to save them in a directory called "Downloads". I can tell Firefox to save them in the downloads directory, but sooner or later, it inevitably decides it's tired of that and goes back to saving them in Downloads.

That's just a problem with the application (and it just happens for that problem to be hidden by a case-insensitive file-system). For a solution, if you can't bend your application to your will, bend yourself to it and rename your "downloads" to whatever Firefox likes.

I download my files into a directory called 'shit'. Maybe I need a filesystem that conflates 'shit' with 'Documents'.

Someone else has already commented to defend the idea that techy-people can get caught up by this, but I'd like to also point out that a massive portion of the market for computing devices is made up by people who are, by some measures, idiots. Just like how a large portion of the market for automobiles is made up of "idiots" with respect to engine rebuilding.

Resolve idiots problems on UI, not on file system.

Case insensitivity is a tiny part of the problem in my view. On a case insensitive filesystem you can still have distinct directories called Document, Documents, My Documents, Documnts, etc. and really, from the user's point of view, why can't you have two directories called Documents? The entire concept of addressing things by name is counterintuitive. Mapping strings that differ only by case into the same name is a teeny band-aid on a gaping wound.

But from the user's point of view, if you have two directories called Documents, which one are you looking at? Or are you saying that the file system should merge them into one directory? Are you saying that the user should know which directory is which by something else, say position in a list (which could be re-ordered by user action)? By some icon associated with the list?

What's your actual solution?

I don't have a specific solution, but whatever it is, it will be considerably different from what we as programmers understand a filesystem to be. It has to be something new. There's no point trying to put little band aids on filesystems to make them friendlier like this. In contexts where the user might have trouble with concepts like C != c, they shouldn't be exposed to raw filenames at all.

Well, in the real world the same user meets a lot of similar dilemmas, like acquainting himself with people with the same name or encountering cars of the same model and color (and what happens then is that one starts to pay more attention to distinctions). Why does it have to be any simpler for data labels? Does that poor user have to be necessarily guarded to such a degree from the reality itself?

"This is almost definitely why OS X uses case-insensitive HFS+ by default."

See my other comment, whereby the Darwin team told me they use case-insensitive FS because Microsoft Office internally converts filenames to uppercase or lowercase at its' leisure, often several, if not tens of times during an operation.

I could be wrong, the person / people I talked to could be incorrect, but I've never heard another explanation that was not speculation.

Your reasoning all applies reasonably well to why Microsoft decided to buck the trend and go case insensitive, since case sensitive was the norm up until they came along afaik.

Adobe products have had issues with HFS+ Case Sensitive FS for YEARS. It's part of why I moved away from them (along with not releasing patches for software 1-2 versions old on versions of OS X 1-2 versions newer).

Here's a sample:




From personal experience, Adobe applications have never worked on case-sensitive HFSX. Also with the legacy compatibility layers OS X shipped with up until 10.6, there were enough expectations on case-insensitivity that it was probably easier for them to just not risk a bunch of legacy apps not working anymore. And by the time 10.7 comes along there's enough momentum to just leave it alone. Their focus now is on Core Storage (an LVM like thing that doesn't have 1/2 the features of LVM), to the degree that now in 10.10 it's the default for both upgrades and new installations. So FWIW these volumes are no longer read/writable elsewhere like on linux or Windows.

Several years ago, when I'd first moved to SF, I got a scholarship to WWDC for working for ACM, and I took advantage of the opportunity to go to a Birds-of-Feather for people interested in "Darwin filesystems" or something like that. It was basically a little conference room with the FS team, and I asked, flat-out, "Why case insensitive?"

And they answered, flat-out: "Microsoft Office"

Even this past week, I was talking with coworkers about having trouble with nothing but Valve Steam on my Macs that have a case-sensitive filesystem[0]. That's particularly odd, since it works on Linux now, but that's another matter.

What I found most notable about this thread, is this quote from Linus:

"And Apple let these monkeys work on their filesystem? Seriously?"

I'm pretty sure Apple actually _fired_ anyone who wanted any of the things done anything close to any way that Linus Torvalds would agree with.

ext* not being particularly perfect, I'm happy to have both. I mean, ext2 is hard to complain about, but it comes from an era where basically all filesystems were terrible, literally the era when SGI started installing backup batteries to race with fsync().

ext4 has an alarming number of corruption bugs, but I'm sure it's not because of insane unicode handling, though I take Linus' description of how the OSX filesystem works with a grain of salt. He can't possibly _care_ to know as much about it as he knows about Linux's.

[0] achievable by formatting HFS+X in Disk Utility in Recovery Mode, then installing onto that drive

> ext4 has an alarming number of corruption bugs

Which bugs are these?

> though I take Linus' description of how the OSX filesystem works with a grain of salt. He can't possibly _care_ to know as much about it as he knows about Linux's.


"Which bugs are these?"

My favorite is, at least a couple years back, if you had a KVM guest with an ext4 filesystem in an image file, on an ext4 filesystem, the guest OS could corrupt the host OS.


Obviously, ext4 is largely a very good FS, but at least twenty people I've managed hundreds of servers with agree that if you do not need huge filesystems, huge files, or directories with tons of files, ext3 is a safer choice.

I'm not saying Linus is incompetent, just that his criticism of other filesystems shouldn't be processed as if his work has no faults.

One thing I wonder is, is it the filesystem, or the C library, that determines things like how unicode is interpreted in paths - his overwhelming rant focus.

> at least a couple years back, if you had a KVM guest with an ext4 filesystem in an image file, on an ext4 filesystem, the guest OS could corrupt the host OS.

Your "alarming number of corruption bugs" is one bug from a couple of years back?

ext4 is fairly well established now. A few years ago it might have been new enough that there were edge cases that needed investigation, but it's robust enough now.

And there's a difference between a decent design having some bugs that can be fixed and a fundamentally broken design. Linus seems to be arguing that HFS+ is the latter.

The problem with HFS+ is precisely that it needs to handle unicode in the filesystem because it needs to consider different names as the same. And, seemingly, it does it even worse than NFTS does.

Why wouldn't Microsoft office not work properly if the file system is case-sensitive ?

Just a guess, but multiple hard-coded references to the same file with inconsistent case?

Can someone explain to me why case insensitivity is a bad thing? Clearly Linus believes so but didn't explain why he believes so.

Most UNIX and Linux systems seem to have an "all lowercase" or "all uppercase" convention, so the fact that they have case sensitivity is often not utilised.

In fact the biggest reason you'd want case sensitivity off the top of my head is legacy support but that's just a circular argument (since you never really reach WHY it was that way originally, just that it was).

I guess based on what he talks about next he is worried about how case insensitivity interacts with other character sets (i.e. does it correctly change their case), but for most sets isn't the lower and upper case defined in the UNICODE language spec itself?

The prime number-one concern in kernel programming is managing complexity. Well, in most programming really, but in kernel programming unmanaged complexity leads to lost data and sometimes broken hardware instead of "just" crashes.

Case-sensitivity is the easiest thing - you take a bytestring from userspace, you search for it exactly in the filesystem. Difficult to get wrong.

Case-insensitivity for ASCII is slightly more complex - thanks to the clever people who designed ASCII, you can convert lower-case to upper-case by clearing a single bit. You don't want to always clear that bit, or else you'd get weirdness like "`" being the lowercase form of "@", so there's a couple of corner-cases to check.

Case-sensitivity for Unicode is a giant mud-ball by comparison. There's no simple bit flip to apply, just a 66KB table of mappings[1] you have to hard-code. And that's not all! Changing the case of a Unicode string can change its length (ß -> SS), sometimes lower -> upper -> lower is not a round-trip conversion (ß -> SS -> ss), and some case-folding rules depend on locale (In Turkish, uppercase LATIN SMALL LETTER I is LATIN CAPITAL LETTER I WITH DOT ABOVE, not LATIN CAPITAL LETTER I like it is in ASCII). Oh, and since Unicode requires that LATIN SMALL LETTER E + COMBINING ACUTE ACCENT should be treated the same way as LATIN SMALL LETTER E WITH ACUTE, you also need to bring in the Unicode normalisation tables too. And keep them up-to-date with each new release of Unicode.

So the last thing a kernel developer wants is Unicode support in a filesystem.

[1]: http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

And even if you do things The Unicode Way, you'll get a constant stream of bug reports from people who don't like The Unicode Way, and want things done Their Way, which is, of course, The Right Way.

There are as many Right Ways as there are people, of course, and no two are completely compatible, so if you want to Do The Right Thing, either go see a movie or write your application software to do it like that, because the OS kernel can only do it one way. And, of course, if you want applications to be able to do case folding, the filesystem needs to be case-sensitive so all the relevant information is preserved.

"And keep them up-to-date with each new release of Unicode"

I agree with the rest, but that is one thing you absolutely shouldn't do. Once there are disks 'out there' that were created with some idea about what case insensitivity is, your choice has been set in stone. The risk of dhanging any rule is simply too high. Somebody might start reading hat disk using the previous version of the file system.

For example, the precursor to HFS+, HFS, like HFS+, kept directories sorted by name. However, it had a filename sorting bug that the Finder had to work around (see http://dubeiko.com/development/FileSystems/HFSPLUS/tn1150.ht...)

> I agree with the rest, but that is one thing you absolutely shouldn't do.

And now you have a file system that is case sensitive, BUT ONLY IN SOME UNICODE RANGES. Somehow, that's better than just not being case sensitive in the first place?

Mac OS X's NFD conversion was the problem for me. The case insensitivity was, superficially anyway, not the problem - you get back the filename you initially wrote (it just could overlap with another file that is not clearly defined because world languages are crazy). But NFD changes the filenames you wrote into a different byte sequence.

I've use all linux systems for a while, recently added a macbook pro retina. I keep some personal directories of stuff backed up on multiple systems, along with a "manifest" of files and md5sums.

NFD conversion meant that file names, which included correctly-encoded utf-8 (NFC ...) accented characters, changed between writing them, and "ls" of the directory.

If I didn't already have experience with utf-16 and all sorts of unicode encoding and translation silliness, debugging this would have been impossible. The fix was to add this to my scripts:

| perl -M'Unicode::Normalize' -M'open qw(:std :utf8)' -ne 'print NFC($_);' |

man oh man was that hard to figure out though. Mac OS X is way way harder to make work right than linux. (See also disk utility)

I get the problem, but don't you need canonical equivalence to make sure you don't allow files with filenames that look the same but which happen to have a different unicode byte sequence?

Or is the problem specifically with NFC? I don't see how byte equality could be preserved the other normalisations.

Well in my case, I don't have to worry about filename spoofing, or even filenames being too similar, for my own directory of records/archives.

The strangeness of a normalizing filesystem, is that you can try to open a filename (with accent, NFC), and it works. Then, a readdir() doesn't list that exact filename byte sequence (you happen to get one that's extremely similar, but NFD).

On linux it always does, there's no normalization, exactly what you open is exactly what's listed. This really is an awefully nice simplification, for this layer of abstraction. But you could argue that I'm just accustomed to the tradeoffs of it by this point.

With or without Unicode's own meddling, characters may still look the same. Think of non-Latin alphabets that have letters defined in an identical fashion to those in Latin alphabet. Heck, even without other alphabets, the Latin letter O and the digit 0 are valid characters for filenames in any system I know of, and although looking the same they are yet different. (Having a bar, a point, or some other distinctive mark inside 0 is just a hack, the common description for the aspect of this digit makes no distinction from letter O.) This problem is not new, definitely not brought up by this Unicode thing.

True, we have seen this issue with fraudulent domain names, where they can pose a security issue.

To me it seems normalisation at least reduces the attack surface, while not solving the issue completely. I am not qualified to judge whether or not the complexity increase is worth it. But what are the downsides - what is so bad about NFC?

Well put.

However, I'd add that what sucks even more is that this is not just a kernel issue. Every single application needs to worry about this.

In this particular instance, the vulnerability is caused by git not handling aliased names correctly.

That's a really good and comprehensive answer, thanks.

I second this. I was trying to understand the reasons and your answer helped me. It saved me a few hours. Thank you!

Speaking from imperfect knowledge, I'd guess: case insensitivity implies allowing for aliasing (case aliasing for case insensitive and god-knows-what for Unicode insensitivity.)

Which means that anywhere you handles names, you explicitly handle these aliases. Miss a spot (or an alias), and you have a security bug.

In this case, the specific exploit Linus is referring to works by commiting a malicious .git file as .Git to your repo. Then somebody else on OSX clones your branch, and the bad .Git file will overwrite your .git file, causing a security breach.

Yes, it would be nice to have universal "uppercase" and "lowercase" rules, but in the real world, collation rules are crazy, arbitrary, and damn near impossible to get right. Now try to get it right in EVERY locale around the world, because if you don't, you open a big gaping security hole or data corruption bug.

Indeed "uppercase/lowercase" is a pretty anglocentric way phrase things. Who know what sort of cases somebody might come up with in their next language. "Normalization" is properly a more general term, but who says a normalization always exists?

I don't know to what extent this affects HFS+, but case insensitivity in general is locale dependent. The classic example is the Turkish distinction between dotted and dotless I - http://en.m.wikipedia.org/wiki/Dotted_and_dotless_I.


The dotless I, I ı, denotes the close back unrounded vowel sound (/ɯ/). Neither the upper nor the lower case version has a dot.

The dotted I, İ i, denotes the close front unrounded vowel sound (/i/). Both the upper and lower case versions have a dot.

End quote.

I can imagine that being a nightmare, particularly if one user uses a Turkish locale and another on the same machine uses English. And all of that complexity in the kernel? Ouch. (Or HFS+ just behaves incorrectly for Turkish file names, not sure that's really better?)

Upper/lower case is locale language sensitive. One specific example, the uppercase for "i" varies depending on whether you have a Turkish locale or not. String.toLowerCase() in Java has exceptions coded in for the "tr", "az", and "lt" locale languages. Some of these languages may cause the length of the string to change when upcased or downcased, too.

Additionally, given a utf8 string, OSX will translate it into another, possibly different utf8 string before using it as a filename (the NFD normalization that Linus mentioned).

The exploit being referenced allows malicious code to be injected into a file that handles git, which is (.git). Now, normally Git does not allow you to overwrite (.git). Since HFS+ is case-insensitive, writing the file as (.Git) will overwrite the (.git) file and break security. Linus is saying that an uppercase and lowercase character are two different things, and not recognizing that can (and did) cause problems with security.

His argument is wrong. The issue is not with the filesystem being case-insensitive; the issue is that some strings which are not identical map to the same filename.

Two strings differing in case is only one such example. Another example is the presence of codepoints which are ignored by the filesystem. The fix for git and Mercurial included the fix for this as well. Does Linus think all filesystems should allow all 256 byte sequences in every permutation as well? (He might, since that is the behavior of common Linux filesystems.)

The proper mindset is to realize the actual issue (multiple strings map to the same filename) and use a "filename_according_to_program -> filename_according_to_filesystem" function everywhere when dealing with filenames, not to blame the filesystem.

You hit the nail on the head, he does expect things to work almost exactly as they do on Linux. The thing about Linus if you've ever spent any time on reading the Linux Kernel mailing list is that he is extremely explosive and at times has really harsh analogies. But...he is usually right.

The question you need the filesystem to answer in this case is, "If I save a file with name A, will it be accessible via name B?" You can't assume that the filesystem has one canonical name from which the file will be accessed.

EDIT: That actually isn't even enough, because the answer may vary by locale.

Even worse, to solve this for Git's case, you need to know the answer

1) For every locale

2) For every supported filesystem

I've run into issues with it from a programming point of view, you have to make sure to do a (slightly) more expensive case insensitive comparison when dealing with filenames.

This is only an issue when it is fake case insensitive. In the sense that it claims it is case insensitive but in fact is case sensitive (and returns case sensitive filenames). NTFS is a perfect example of this, it claims to be case insensitive but in fact stores everything in a case sensitive way (so you wind up doing things like ToUpper() on every filename).

On a legitimately case insensitive filesystem both "etc" and "ETC" would return identical filenames as they would be stored identically in the file system's database (e.g. would return "etc" for both).

So what you're really giving is not an example of why case insensitivity is bad, but why half measures are bad.

Pro-tip; never do .ToUpper() / .ToLower(), use foo.Equals(bar, StringComparison.OrdinalIgnoreCase), saves a string allocation also only loops over the strings once. For the truly anal.

>NTFS is a perfect example of this, it claims to be case insensitive

Where does it claim this? In any case this is not true because...

>but in fact stores everything in a case sensitive way

This does not have to be true. NTFS can be case-sensitive or case-insensitive depending on which subsystem is accessing it. For example, while the POSIX subsystem for Windows was a thing, installing it had the side-effect of turning NTFS case-sensitive.

Also a longer critique by John Siracusa here: http://arstechnica.com/apple/2011/07/mac-os-x-10-7/12/#hfs-p...

I wonder how this ties in with the whole Apple philosophy of "Design is how it works."

Clearly the innards look nothing like the facade.

> File system metadata structures in HFS+ have global locks. Only one process can update the file system at a time.

Holy heck! How does that work in practice? Do operations get queued and then a single kernel process who can take the lock do the atomic updates?

I have to imagine that is going to cause a bottleneck however, as all non-read operations need to update the metadata (e.g. timestamp, maybe size if it is stored).

That all being said I haven't noticed OS X being particularly slower to do things than e.g. Windows. So if that is the case they're hiding it well.

HFS+ is an oddball filesystem among FSes that anyone actually uses. Since the earliest days of the Mac, it's been able to e.g. track file identity even as a file is moved around, and the metadata to support this lives in a volume-wide tree called the Catalog File. So, there's your global lock. (Add an SSD as necessary for better performance.)

OS X already has some pretty high-level file and metadata APIs not found on other systems, so maybe Apple's future plans don't look like a traditional Unix file system at all. They've already demonstrated they know how to make a very weird, non-Unix filesystem look like one. ;)

Wonder why the "HFS+ Private Data" hack was chosen when Rhapsody existed even back in 1997.

Scroll down for Linus' commentary.

> "The true horrors of HFS+ are not in how it's not a great

> filesystem, but in how it's actively designed to be a bad

> filesystem by people who thought they had good ideas."

There doesn't seem to be a way to deep link to comments in G+?

G+ is a train wreck for a number of reasons, this included.

Yeah, he's not a fan of HFS+ at all. Wasn't the plan to move to ZFS prior to the Oracle acquisition of Sun? Hopefully that ends up back on track somehow.

ZFS on OS X was abandoned several years ago. Unless Apple has a new filesystem under development as a skunkworks project, HFS+ is the filesystem on OS X for the foreseeable future.

And even if they did build a new FS, there are some assumptions like case insensitivity that likely couldn't be abandoned without potentially breaking backwards compatibility.

And in effect ZFS is forked. There's openzfs and Oracle ZFS, so if Apple were to go with ZFS, which one would they pick? The smart choice would be openzfs which almost certainly means they end up licensing Oracle's ZFS at which point it's like, classic Apple - let's try whenever possible to make interoperability either difficult or hilarious.

Apple's Core Storage does have some COW properties, so they could probably do something like linux LVM thin provisioning style snapshots. They could maybe add a tree for checksums, which would be a good idea for the journal at least but seeing as this hasn't been a priority for even just the journal, the most unencumbered and recent add-on, it seems obviously not a priority. The COW stuff was necessary to automagically migrate data from the SSD to the HDD in their so called "fusion drive", so maybe they leverage that for an updated version of Time Machine and could also do it for atomic online OS/software update with an optional later reboot rather than a mandatory one.

One rather cool thing it does is on-the-fly encryption and decryption of an LV. You can even reboot in the middle of either conversion and it resumes where it left off. And those encrypted volumes can be resized (that's not well documented but is supported and is even used by their own Boot Camp Assistant).

So about the limit of the new fs is an LVM like thing under journaled-HFS+/X.

Only if AAPL acquires/merges with ORCL.

3rd-party: there's the non-FUSE port https://openzfsonosx.org/

Project predated oracle, back in the osx 10.5-10.6 days when there was the dtrace integration as well. There were internal builds of the zfs kext and at one point I saw leaked source on the Internet. Project was abandoned because of legals disagreements over cddl licensing, as I recall.

CDDL licensing? more likely NetApp scared Apple off with patent aggression


It was in the dropdown list in Disk Utilities in Leopard developer betas. I never actually tried using it.

Isn't there a problem with ZFS' memory footprint to be usable for desktop systems?

ZFS works a lot better if you dedicate lots of memory to it, yes.

Apple sells far more iOS devices than Macs these days, and the iOS filesystem is case-sensitive HFS+. It's hard to imagine Apple doing something new in filesystems that doesn't improve the platform they make the most money from.

In a way, the same logic applies to Macs. The number of Macs that support multiple internal disks will soon be zero if it isn't already, so some of the best features of ZFS are irrelevant on new Apple computers.

With thunderbolt ports on very Mac why should storage be limited to internal devices?

Because they want you on iCloud. You can't even install the OS without an Apple ID and an active internet connection anymore.

This is extremely false.

Which part? The iCloud part or the OS install part?

I just finished reinstalling Yosemite an hour ago after wiping the HD of a friend's new laptop. It would not allow me to get past the eligibility check until I logged in with a valid Apple ID.

Your friend may have had Find my Mac turned on. At that point the only way to get an OS installed on it is to have the original apple ID it was turned on under.

Nope. She never used it. It was the initial "determining the eligibility of this machine" phase where it asked me to log in (originally, this was to check that you'd indeed purchased OSX Lion, but the OS is "free" now). If "Find my Mac" is turned on, it won't even let you wipe the hard drive.

Sounds like you were using Internet Recovery. It's downloading the OS installer from the App Store - if you had install media it'd work fine.

It's a macbook air. There is no install media.

You can make one if you choose (most people use a USB stick, I'm using an ExpressCard SSD) - this will remove the need for an AppleID and internet access to do the install.

For the vast majority of people this will never be an issue as the machine comes pre-installed and re-installing the OS is generally not required unless something major goes wrong.

Apart from being false this doesn't make sense. Apple put thunderbolt ports on the machines they sell. You're suggesting they included ports so customers wouldn't use them?

It seems there was a licensing problem using ZFS on OSX: http://arstechnica.com/apple/2009/10/apple-abandons-zfs-on-m...

I'm rather amused (and, as always, a bit saddened) by the comment of Terry A. Davis, of TempleOS fame/notoriety.

"linux doesn't have to search parent directories for file-not-found, but I do."


Edit: further parsing reveals he implemented a (read-only) overlay system in his FS. Interesting, I wonder what the side-effects (vulns) could be?

The word "vulnerability" doesn't make sense when talking about TempleOS, since it does not even attempt to offer any kind of security.

"Quite frankly, HFS+ is probably the worst filesystem ever."

Can anyone summarize why he thinks this?

HFS+ has been patched with duct tape and pieces of cardboard for 20 years; receiving journaling, support for Unix attributes, extended attributes, 64 bits sizing, multi-processing, multi-users, hard links, etc. over the years. Really it's a sort of monument to kludge. It should have been ditched like 10 years ago.

HFS+ is basically the Apple equivalent of an alternate universe where Microsoft just keeps gluing new features on to FAT forever and never creates NTFS.

I'm not sure that's terribly different from ext4's lineage, which is generally accepted (AFAIK) as a pretty good file system, even if not cutting-edge.

You might've missed Linus' second comment on the page, wherein he talks about case-insensitivity and NFD normalization.

Because he hasn't used MFS. ;-)

HFS+ is probably the worst filesystem in common use right now; even FAT has the benefit of simplicity. Most of its issues, however, are with its horrific implementation; the Unicode naming is kind of bad but Linus manages to be wrong about several things.

Regarding case sensitivity: it is generally accepted among the user interface crowd that (Western) users don't really understand that 'C' and 'c' are different things; they're "both" 'c'. Case-preserving is thus the accepted practice. However case manipulation is not an operation that can be done absent a locale; my go to example here is that 'i' upcases to 'I' unless you're a Turk in which case it upcases to 'İ'. Similar although not quite as bad is the fact that 'ß' upcases to 'ẞ' U+1E9E in some exotic circumstances; see http://en.wikipedia.org/wiki/Capital_ẞ for details. Similar limitations apply to sorting, which users also expect.

Regarding Unicode: NFD is a normalization format; it converts 'é' U+00E9 and 'é' U+0065 U+0301--which are semantically identical--into the same coding. As it happens NFD picks U+0065 U+0301 for that string; NFC picks U+00E9. Any time there is ambiguity, NF[CD] will retain the original ordering. Calling it "destroying user data" is meaningless histrionics. Most of the time we tend to use NFC. I am told that NFD has certain advantages for sorting, where one might want to match the French word 'être' with the search string 'etre'; in NFC this requires large equivalence tables but in NFD the root character is the same in both cases. Linus's claim 'Even the people who think normalization is a good thing admit that NFD is a bad format, and certainly not for data exchange.' has a big [citation needed] tag attached.

As it happens, my personal belief is the following: Given that users expect case sensitivity and locale specific ordering, which complicate filesystem design tremendously. Given that users mostly interact with the system through GUI dialogs, which already hide system files (files with the hidden bit in HFS+, or files starting with '.'). Therefore, extract the case sensitivity to a layer, used by the GUI, which can understand the user's locale and so fold case properly. This layer should be available to command line applications so that they can use the same rules if they so choose. The underlying filesystem will then be case insensitive, but is still used to encode Unicode data; the right thing to do here is to normalize. Either NFC or NFD is fine, really.

For pedants: the related NFKC/NFKD forms add a canonicalization step and are absolutely not semantically safe in any way, for all that they're useful for sorting.

And then, you receive a zip file from Linus that has file.c, File.c and FILE.c files on it. You extract it, and then? They either end up on disk, breaking your case-insensitive UI layer (yes, you can see those files, but can you copy them elsewhere?), or they don't, breaking the makefile that's also in the archive. Here be dragons.

Locale-specific ordering of course _must_ be done outside the disk because disks may move between systems with different locales, locales can be changed at will, and multiple users could read the same directory with different active locales (well, must: one could store a locale for sorting per directory and force that on he user, but that is madness)

Also, reading http://dubeiko.com/development/FileSystems/HFSPLUS/tn1150.ht... (which I can't find anymore on Apple.com), HFS+ doesn't quite use full NFD because it sometimes destroys information that Apple deemed worth keeping.

Creating three files named file.c, File.c, and FILE.c seems like being user hostile for the sake of being user hostile. Even for people using a case sensitive filesystem, it's a dick move. Imagine talking about your project over the phone. "Yeah, the problem is in file.c. No, not File.c, file.c. Idiot! I clearly said file.c, not FILE.c!"

Yeah, I don't think that's a reasonable thing to do even if you can do it.

I've seen both Makefile and makefile in one archive.

Which is pretty ridiculous. Convention dictates that people use Makefile so that's the file people read and edit, yet the make utility reads makefile first by default. Including both is a great way to confuse users.

Yep. It definitely had that effect on me :)

In fact, the ATK [1] source tarball does have two files differing only in the first letter. One starts with lower case and the other with upper. I came across this while extracting the tarball on Windows - 7zip asked me mid-extract to confirm overwriting an existing file.

Thankfully the files are just (auto-generated) documentation files, so the loss of one file doesn't break the build or anything.

[1] https://developer.gnome.org/atk/

Yea, but it would be better to convert filenames from disk at comparison time.

The main reason all this tends to get pushed down to the FS layer is because of the question "what happens if the user accidentally creates two files with the 'same name' but different coding sequences". When it's the difference between README and readme I'm inclined to be like "well, that was dumb, move on with life" but it's different somehow for visually indistinguishable situations.

1) What are the odds Tim Cook or Craig Federighi will here about this?

2) What are good filesystems for OS X to adopt? OS X supports other filesystems. Is there a way to force it to install the OS onto a different filesystem, like ext4?

Re: 2), there are significant parts of OS X that seem to either check for HFS+ or rely on its implementation and bugs, and those things don't work properly on other filesystems. OpenZFS, for instance, still doesn't work with Spotlight, on which a surprising number of things depend these days.


If you want to use all the software features of your Mac, your only option is HFS+.

Soon it will support Spotlight. See https://openzfsonosx.org/wiki/Changelog#1.3.1-RC5

OS X doesn't have full native support for any filesystem other than HFS+.

It supported UFS up until 10.9, but that was ancient and awful.

It can read and write FAT, but that's even worse. (It doesn't even support permissions, so it couldn't possibly boot from a FAT volume.)

It can read and write FATX. I don't think that supports permissions either, though, and it's also annoyingly patent-encumbered.

It can read NTFS, but not write to it.

NTFS3g not was ported to OS X ?

ntfs-3g was indeed ported and can be easily installed, but I think the point was is writing to NTFS is not natively supported.

I agree with Linus from a technical point of view but I think Apple had many considerations here.

A number of games (and possibly other programs) on the App Store alone specifically mention that they will not work on Macs configured with case-sensitive file systems. My guess is that this aids programmers who may have ported something from Windows and not tested all possible file/path dependencies.

This may also help users when copying files from Windows network disks or Mac legacy systems where (from their point of view) they expect things to work.

BTW, one other stupidity is that Linux's HFS+ implementation refuses to mount journaled volumes when Apple designed it to be backward compatible.

That's not correct. It won't rw mount journaled HFS+ volumes, they are mounted ro by default. You can force them to mount rw and it'll ignore the journal possibly at the immediate or near future peril of the filesystem, depending on what state its in. I don't know why linux is so far behind on supporting it and doesn't bode well for supporting Core Storage volumes. Near as I can tell while HFS+/X are part of Darwin and thus at least sorta open sourced, I'm not finding any of the Core Storage stuff open sourced meaning it'd have to be completely reverse engineered.

How could that possibly work? I presume by "backward compatible" you mean that the data outside the journal remains consistent, and the journal layer is capable of detecting modifications made by non-journaled mounts.

That's... fine, I guess. It prevents the obvious corruption cases. But the only plausible recovery mechanism after such a mount is to throw out the journal! That's not likely to be acceptable to most users ("I booted to linux and back, and now a bunch of new files disappeared!").

That's stretching the meaning of "compatible" too far.

Pretty much. Look for lastModifiedVersion in: http://dubeiko.com/development/FileSystems/HFSPLUS/tn1150.ht...

Then I fail to see how this is a stupidity. Linux is doing the right thing: there's no way for it to mount the filesystem without damage.

I agree that HFS+ and its API is outdated.

Case insensitivity I find useful, OTOH.



Thank God we have someone like Linus to keep BIG TECH in line. It's not all about marketing Apple, Microsoft. It's about good design, and openness.

I would expect far more downvotes than that! Come on what's wrong with you "Hacker News". Only 1 downvote when I'm slammin' Apple for their shitty strategy? Surely some of you have at least some downvotes available to use against that!!!

Even after a good night's sleep, I'm happy I decided to post this comment. The people hiding behind their karma scores to downvote perfectly legitimate and contextually relevant comments are one of the most frustrating and annoying things about Hacker News.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact