I have to point out that case sensitivity offloads a bunch of that complexity to the user. This is almost definitely why OS X uses case-insensitive HFS+ by default.
As a ultra-simple example: with a case-sensitive file system I can have two directories, "documents" and "Documents". Which one are my documents in? Half in one, half in the other, probably.
I'm not saying Linus at al. are wrong that case-sensitive is the way to go, but there are some reasonable arguments for trying to take that complexity back from the user and deal with it in the file system.
Source: I'm a long time mac user, and I just switched to a case-sensitive file system last year. The win of having my dev machine match my deployment environment (iOS) is bigger than any downsides I've seen yet.
But, I believe all that proves is that in many cases, "what I'm used to" is the same as "easy and intuitive". There is often no one true way to make something complex easy and intuitive, and sometimes the "easy way" is the way you're used to (but for someone used to another way, it won't be easy).
In this case, in the absence of compelling evidence that case-insensitivity is a net win for the user, the simpler implementation is probably the right one. The number of bugs over the years caused by insensitivity and Unicode mapping and such, is probably further evidence favoring simplicity.
Also, the average Mac user doesn't even know they have a command line (seriously, I can't remember ever saying to a casual Mac user, "Open terminal" and have them say anything other than, "I don't know what that is" or "I don't have that"). There's no reason a higher layer can't make case-insensitive decisions for the user (say, searching for "the clash" will find music by "The Clash"), which doesn't require the freaking filesystem to make those kinds of guesses. Having this happen at the filesystem layer has always seemed utterly mad, to me. And, having it happen at the higher layers is how Linux and UNIX software has always handled it. Somehow we muddle through with the filesystem being entirely naive about the letter "a" being sort of the same as "A" (for humans).
I think it is very reasonable to say that common users of the Alphabet expects "Documents" and "documents" to be the same word. I think it is even safe to assume most readers consider "john" and "John" the same name, while "Jon" and "John" are not.
I can't think of any situation where I have needed to have two differently capitalized files reside in the same folder, but I can remember dozen of times I've been annoyed by it.
My point is, there are arguments to calling case-sensetivity for user hostile, but you don't present any to counter it.
I'm not sure that having such a fundamental thing as whether two file names are the same be dependent on such complex things as unicode, canonical forms, etc. is a good idea.
I won't comment on whether it's more user-friendly or hostile -- I have no data to back that up anyway. It's ridiculously complex and that's what I would base my decision on. At least using "a string of bytes" as the file name is simple conceptually and pretty hard to get wrong.
In reality, there is a sliding scale between usefulness and complexity in filenames, and where one person thinks the optimal trade-off point is may not match what others think. Many filesystems have chosen (rightly so in my opinion) to just allow as much as possible, and let convention and the application layer sort it out. There are many, many complexities to character normalization and casing (think unicode and languages where multiple characters may be combined into a single separate character, or where there are choices to make on the correct way to change case based on the word itself), and again the problem is to be where to draw the line at what the filesystem should do for you.
The real problem is that people think that filesystems let us name files with words, when really we are just tagging file with characters. Words, as used in language, are an entirely more complex beast, and IMHO, entirely the wrong think a filesystem should care about.
This is, of course, a mild bit of user hostility, but it is less friendly and harder to use, to me, than case sensitivity. As I noted above, it's mostly a matter of what you're used to, but in the absence of a reasonable case that case insensitivity makes life easier for users, the right implementation is the simpler one.
The more dangerous stuff is where the OS decides that one file is actually a different file, through character conversion or through Unicode mapping. And, that, (not) coincidentally, is where serious security bugs were found in git. And, git is not alone in running into these problems.
In short, I find it annoying that the OS would decide it knows what I want, when I ask for something else. But, more importantly, it has been proven occasionally to be dangerous, and often to be error-prone, in implementation. But, on that front, I'm probably just going to repeat what Linus said less effectively.
This is a source of confusion, but filesystem convention is not the place to solve it because it can't solve it. I can still have three directories for bug #12 "ticket_12", "ticket12" and "ticket 12" Which one is my investigation in? Is documentation in "doc" or "docs"? etc...
Perhaps one could calculate some sort of visual distance between filenames and not allow any two in the same folder, if their distance were less than some constant.
Most of the time, having both a `foo` and a `Foo` identifier at the same time is a bad idea: it's hard to remember which is which. The only time it makes sense is when it's backed up by strong conventions: say, `foo` is a method and `Foo` is a class. With these conventions, you're not really remembering `foo` and `Foo`, you're remembering "`foo` method" and "`foo` class". In a sense, it's just a hack to have sigils use an otherwise unused part of the identifier, just like tagged pointers use the unused bits of a pointer for metadata.
Using capitalization for sigils like this is fine, and works well in practice, but you could as easily use some other symbols in the same role. At the very least, it helps to have the conventions built right into your language (like Haskell) than just followed manually (like Java). Moreover, having significant capitalization outside the conventions (ie in the rest of the word) will still just be confusing: was is `subString` or `substring`?
I don't know what the best solution is, but case sensitivity everywhere isn't it.
That only really happens if you're an idiot who just haphazardly throws documents into any directory that looks like it can still hold files. And it's not an indictment of case sensitivity; you can achieve the same sort of stupidity in plenty of other ways if you're determined to do so.
What's your actual solution?
See my other comment, whereby the Darwin team told me they use case-insensitive FS because Microsoft Office internally converts filenames to uppercase or lowercase at its' leisure, often several, if not tens of times during an operation.
I could be wrong, the person / people I talked to could be incorrect, but I've never heard another explanation that was not speculation.
Your reasoning all applies reasonably well to why Microsoft decided to buck the trend and go case insensitive, since case sensitive was the norm up until they came along afaik.
Here's a sample:
And they answered, flat-out: "Microsoft Office"
Even this past week, I was talking with coworkers about having trouble with nothing but Valve Steam on my Macs that have a case-sensitive filesystem. That's particularly odd, since it works on Linux now, but that's another matter.
What I found most notable about this thread, is this quote from Linus:
"And Apple let these monkeys work on their filesystem? Seriously?"
I'm pretty sure Apple actually _fired_ anyone who wanted any of the things done anything close to any way that Linus Torvalds would agree with.
ext* not being particularly perfect, I'm happy to have both. I mean, ext2 is hard to complain about, but it comes from an era where basically all filesystems were terrible, literally the era when SGI started installing backup batteries to race with fsync().
ext4 has an alarming number of corruption bugs, but I'm sure it's not because of insane unicode handling, though I take Linus' description of how the OSX filesystem works with a grain of salt. He can't possibly _care_ to know as much about it as he knows about Linux's.
 achievable by formatting HFS+X in Disk Utility in Recovery Mode, then installing onto that drive
Which bugs are these?
> though I take Linus' description of how the OSX filesystem works with a grain of salt. He can't possibly _care_ to know as much about it as he knows about Linux's.
My favorite is, at least a couple years back, if you had a KVM guest with an ext4 filesystem in an image file, on an ext4 filesystem, the guest OS could corrupt the host OS.
Obviously, ext4 is largely a very good FS, but at least twenty people I've managed hundreds of servers with agree that if you do not need huge filesystems, huge files, or directories with tons of files, ext3 is a safer choice.
I'm not saying Linus is incompetent, just that his criticism of other filesystems shouldn't be processed as if his work has no faults.
One thing I wonder is, is it the filesystem, or the C library, that determines things like how unicode is interpreted in paths - his overwhelming rant focus.
Your "alarming number of corruption bugs" is one bug from a couple of years back?
ext4 is fairly well established now. A few years ago it might have been new enough that there were edge cases that needed investigation, but it's robust enough now.
And there's a difference between a decent design having some bugs that can be fixed and a fundamentally broken design. Linus seems to be arguing that HFS+ is the latter.
The problem with HFS+ is precisely that it needs to handle unicode in the filesystem because it needs to consider different names as the same. And, seemingly, it does it even worse than NFTS does.
Most UNIX and Linux systems seem to have an "all lowercase" or "all uppercase" convention, so the fact that they have case sensitivity is often not utilised.
In fact the biggest reason you'd want case sensitivity off the top of my head is legacy support but that's just a circular argument (since you never really reach WHY it was that way originally, just that it was).
I guess based on what he talks about next he is worried about how case insensitivity interacts with other character sets (i.e. does it correctly change their case), but for most sets isn't the lower and upper case defined in the UNICODE language spec itself?
Case-sensitivity is the easiest thing - you take a bytestring from userspace, you search for it exactly in the filesystem. Difficult to get wrong.
Case-insensitivity for ASCII is slightly more complex - thanks to the clever people who designed ASCII, you can convert lower-case to upper-case by clearing a single bit. You don't want to always clear that bit, or else you'd get weirdness like "`" being the lowercase form of "@", so there's a couple of corner-cases to check.
Case-sensitivity for Unicode is a giant mud-ball by comparison. There's no simple bit flip to apply, just a 66KB table of mappings you have to hard-code. And that's not all! Changing the case of a Unicode string can change its length (ß -> SS), sometimes lower -> upper -> lower is not a round-trip conversion (ß -> SS -> ss), and some case-folding rules depend on locale (In Turkish, uppercase LATIN SMALL LETTER I is LATIN CAPITAL LETTER I WITH DOT ABOVE, not LATIN CAPITAL LETTER I like it is in ASCII). Oh, and since Unicode requires that LATIN SMALL LETTER E + COMBINING ACUTE ACCENT should be treated the same way as LATIN SMALL LETTER E WITH ACUTE, you also need to bring in the Unicode normalisation tables too. And keep them up-to-date with each new release of Unicode.
So the last thing a kernel developer wants is Unicode support in a filesystem.
There are as many Right Ways as there are people, of course, and no two are completely compatible, so if you want to Do The Right Thing, either go see a movie or write your application software to do it like that, because the OS kernel can only do it one way. And, of course, if you want applications to be able to do case folding, the filesystem needs to be case-sensitive so all the relevant information is preserved.
I agree with the rest, but that is one thing you absolutely shouldn't do. Once there are disks 'out there' that were created with some idea about what case insensitivity is, your choice has been set in stone. The risk of dhanging any rule is simply too high. Somebody might start reading hat disk using the previous version of the file system.
For example, the precursor to HFS+, HFS, like HFS+, kept directories sorted by name. However, it had a filename sorting bug that the Finder had to work around (see http://dubeiko.com/development/FileSystems/HFSPLUS/tn1150.ht...)
And now you have a file system that is case sensitive, BUT ONLY IN SOME UNICODE RANGES. Somehow, that's better than just not being case sensitive in the first place?
I've use all linux systems for a while, recently added a macbook pro retina. I keep some personal directories of stuff backed up on multiple systems, along with a "manifest" of files and md5sums.
NFD conversion meant that file names, which included correctly-encoded utf-8 (NFC ...) accented characters, changed between writing them, and "ls" of the directory.
If I didn't already have experience with utf-16 and all sorts of unicode encoding and translation silliness, debugging this would have been impossible. The fix was to add this to my scripts:
| perl -M'Unicode::Normalize' -M'open qw(:std :utf8)' -ne 'print NFC($_);' |
man oh man was that hard to figure out though. Mac OS X is way way harder to make work right than linux. (See also disk utility)
Or is the problem specifically with NFC? I don't see how byte equality could be preserved the other normalisations.
The strangeness of a normalizing filesystem, is that you can try to open a filename (with accent, NFC), and it works. Then, a readdir() doesn't list that exact filename byte sequence (you happen to get one that's extremely similar, but NFD).
On linux it always does, there's no normalization, exactly what you open is exactly what's listed. This really is an awefully nice simplification, for this layer of abstraction. But you could argue that I'm just accustomed to the tradeoffs of it by this point.
To me it seems normalisation at least reduces the attack surface, while not solving the issue completely. I am not qualified to judge whether or not the complexity increase is worth it. But what are the downsides - what is so bad about NFC?
However, I'd add that what sucks even more is that this is not just a kernel issue. Every single application needs to worry about this.
In this particular instance, the vulnerability is caused by git not handling aliased names correctly.
Which means that anywhere you handles names, you explicitly handle these aliases. Miss a spot (or an alias), and you have a security bug.
The dotless I, I ı, denotes the close back unrounded vowel sound (/ɯ/). Neither the upper nor the lower case version has a dot.
The dotted I, İ i, denotes the close front unrounded vowel sound (/i/). Both the upper and lower case versions have a dot.
I can imagine that being a nightmare, particularly if one user uses a Turkish locale and another on the same machine uses English. And all of that complexity in the kernel? Ouch. (Or HFS+ just behaves incorrectly for Turkish file names, not sure that's really better?)
Additionally, given a utf8 string, OSX will translate it into another, possibly different utf8 string before using it as a filename (the NFD normalization that Linus mentioned).
Two strings differing in case is only one such example. Another example is the presence of codepoints which are ignored by the filesystem. The fix for git and Mercurial included the fix for this as well. Does Linus think all filesystems should allow all 256 byte sequences in every permutation as well? (He might, since that is the behavior of common Linux filesystems.)
The proper mindset is to realize the actual issue (multiple strings map to the same filename) and use a "filename_according_to_program -> filename_according_to_filesystem" function everywhere when dealing with filenames, not to blame the filesystem.
EDIT: That actually isn't even enough, because the answer may vary by locale.
Even worse, to solve this for Git's case, you need to know the answer
1) For every locale
2) For every supported filesystem
On a legitimately case insensitive filesystem both "etc" and "ETC" would return identical filenames as they would be stored identically in the file system's database (e.g. would return "etc" for both).
So what you're really giving is not an example of why case insensitivity is bad, but why half measures are bad.
Where does it claim this? In any case this is not true because...
>but in fact stores everything in a case sensitive way
This does not have to be true. NTFS can be case-sensitive or case-insensitive depending on which subsystem is accessing it. For example, while the POSIX subsystem for Windows was a thing, installing it had the side-effect of turning NTFS case-sensitive.
I wonder how this ties in with the whole Apple philosophy of "Design is how it works."
Clearly the innards look nothing like the facade.
Holy heck! How does that work in practice? Do operations get queued and then a single kernel process who can take the lock do the atomic updates?
I have to imagine that is going to cause a bottleneck however, as all non-read operations need to update the metadata (e.g. timestamp, maybe size if it is stored).
That all being said I haven't noticed OS X being particularly slower to do things than e.g. Windows. So if that is the case they're hiding it well.
OS X already has some pretty high-level file and metadata APIs not found on other systems, so maybe Apple's future plans don't look like a traditional Unix file system at all. They've already demonstrated they know how to make a very weird, non-Unix filesystem look like one. ;)
> "The true horrors of HFS+ are not in how it's not a great
> filesystem, but in how it's actively designed to be a bad
> filesystem by people who thought they had good ideas."
There doesn't seem to be a way to deep link to comments in G+?
Yeah, he's not a fan of HFS+ at all. Wasn't the plan to move to ZFS prior to the Oracle acquisition of Sun? Hopefully that ends up back on track somehow.
And even if they did build a new FS, there are some assumptions like case insensitivity that likely couldn't be abandoned without potentially breaking backwards compatibility.
Apple's Core Storage does have some COW properties, so they could probably do something like linux LVM thin provisioning style snapshots. They could maybe add a tree for checksums, which would be a good idea for the journal at least but seeing as this hasn't been a priority for even just the journal, the most unencumbered and recent add-on, it seems obviously not a priority. The COW stuff was necessary to automagically migrate data from the SSD to the HDD in their so called "fusion drive", so maybe they leverage that for an updated version of Time Machine and could also do it for atomic online OS/software update with an optional later reboot rather than a mandatory one.
One rather cool thing it does is on-the-fly encryption and decryption of an LV. You can even reboot in the middle of either conversion and it resumes where it left off. And those encrypted volumes can be resized (that's not well documented but is supported and is even used by their own Boot Camp Assistant).
So about the limit of the new fs is an LVM like thing under journaled-HFS+/X.
3rd-party: there's the non-FUSE port https://openzfsonosx.org/
Apple sells far more iOS devices than Macs these days, and the iOS filesystem is case-sensitive HFS+. It's hard to imagine Apple doing something new in filesystems that doesn't improve the platform they make the most money from.
In a way, the same logic applies to Macs. The number of Macs that support multiple internal disks will soon be zero if it isn't already, so some of the best features of ZFS are irrelevant on new Apple computers.
I just finished reinstalling Yosemite an hour ago after wiping the HD of a friend's new laptop. It would not allow me to get past the eligibility check until I logged in with a valid Apple ID.
For the vast majority of people this will never be an issue as the machine comes pre-installed and re-installing the OS is generally not required unless something major goes wrong.
"linux doesn't have to search parent directories for file-not-found, but I do."
Edit: further parsing reveals he implemented a (read-only) overlay system in his FS. Interesting, I wonder what the side-effects (vulns) could be?
Can anyone summarize why he thinks this?
Regarding case sensitivity: it is generally accepted among the user
interface crowd that (Western) users don't really understand that 'C'
and 'c' are different things; they're "both" 'c'. Case-preserving is
thus the accepted practice. However case manipulation is not an
operation that can be done absent a locale; my go to example here is
that 'i' upcases to 'I' unless you're a Turk in which case it upcases
to 'İ'. Similar although not quite as bad is the fact that 'ß'
upcases to 'ẞ' U+1E9E in some exotic circumstances; see
http://en.wikipedia.org/wiki/Capital_ẞ for details. Similar
limitations apply to sorting, which users also expect.
Regarding Unicode: NFD is a normalization format; it converts 'é'
U+00E9 and 'é' U+0065 U+0301--which are semantically identical--into
the same coding. As it happens NFD picks U+0065 U+0301 for that
string; NFC picks U+00E9. Any time there is ambiguity, NF[CD] will
retain the original ordering. Calling it "destroying user data" is
meaningless histrionics. Most of the time we tend to use NFC. I am
told that NFD has certain advantages for sorting, where one might want
to match the French word 'être' with the search string 'etre'; in NFC
this requires large equivalence tables but in NFD the root character
is the same in both cases. Linus's claim 'Even the people who think
normalization is a good thing admit that NFD is a bad format, and
certainly not for data exchange.' has a big  tag
As it happens, my personal belief is the following: Given that users
expect case sensitivity and locale specific ordering, which complicate
filesystem design tremendously. Given that users mostly interact with
the system through GUI dialogs, which already hide system files (files
with the hidden bit in HFS+, or files starting with '.'). Therefore,
extract the case sensitivity to a layer, used by the GUI, which can
understand the user's locale and so fold case properly. This layer
should be available to command line applications so that they can use
the same rules if they so choose. The underlying filesystem will then
be case insensitive, but is still used to encode Unicode data; the
right thing to do here is to normalize. Either NFC or NFD is fine,
For pedants: the related NFKC/NFKD forms add a canonicalization step
and are absolutely not semantically safe in any way, for all that
they're useful for sorting.
Locale-specific ordering of course _must_ be done outside the disk because disks may move between systems with different locales, locales can be changed at will, and multiple users could read the same directory with different active locales (well, must: one could store a locale for sorting per directory and force that on he user, but that is madness)
Also, reading http://dubeiko.com/development/FileSystems/HFSPLUS/tn1150.ht... (which I can't find anymore on Apple.com), HFS+ doesn't quite use full NFD because it sometimes destroys information that Apple deemed worth keeping.
Thankfully the files are just (auto-generated) documentation files, so the loss of one file doesn't break the build or anything.
2) What are good filesystems for OS X to adopt? OS X supports other filesystems. Is there a way to force it to install the OS onto a different filesystem, like ext4?
If you want to use all the software features of your Mac, your only option is HFS+.
It supported UFS up until 10.9, but that was ancient and awful.
It can read and write FAT, but that's even worse. (It doesn't even support permissions, so it couldn't possibly boot from a FAT volume.)
It can read and write FATX. I don't think that supports permissions either, though, and it's also annoyingly patent-encumbered.
It can read NTFS, but not write to it.
A number of games (and possibly other programs) on the App Store alone specifically mention that they will not work on Macs configured with case-sensitive file systems. My guess is that this aids programmers who may have ported something from Windows and not tested all possible file/path dependencies.
This may also help users when copying files from Windows network disks or Mac legacy systems where (from their point of view) they expect things to work.
That's... fine, I guess. It prevents the obvious corruption cases. But the only plausible recovery mechanism after such a mount is to throw out the journal! That's not likely to be acceptable to most users ("I booted to linux and back, and now a bunch of new files disappeared!").
That's stretching the meaning of "compatible" too far.
Case insensitivity I find useful, OTOH.