Hacker News new | past | comments | ask | show | jobs | submit login
What’s New in Apple Filesystems [pdf] (devstreaming-cdn.apple.com)
198 points by sebiw 14 days ago | hide | past | web | favorite | 187 comments




This is the key. The PDF fails to make sense in parts without the video.


Initially only found the PDF, then saw the video link at the end.


For anyone else wondering about WWDC19, all of these PDFs are accompanied by a video. I guess they were separate in-person workshops.


The read-only system volume really seems like a neat idea. It reminds me of the Transactional Server role in openSUSE where they make system upgrades as atomic as possible: https://news.opensuse.org/2018/05/15/transactional-updates-i...

That said, I don't understand the use case for firm links. Seems like the problem it's trying to solve can just be solved with simple mount points: mounting read-write /Users and /usr/local. Am I missing something?


From the video:

"Traditionally, it was done by mounting file systems on top of directories in the root file system. With the number of crossing points which we need to introduce and the number of volumes which it would require in the file system, that approach becomes rather expensive."


In the diagram, there are just two directories, and mount points could easily handle that case. In the video he describes how the scale of things makes using mount points here unwieldy.


A few more details that I've managed to suss out. The Data volume is mounted on /System/Volumes/Data. Currently, there are 21 firm links into it. They are enumerated in /usr/share/firmlinks (included below).

    /AppleInternal Device/AppleInternal
    /Applications Device/Applications
    /Library Device/Library
    /System/Library/Caches Device/System/Library/Caches
    /System/Library/Assets Device/System/Library/Assets
    /System/Library/PreinstalledAssets Device/System/Library/PreinstalledAssets
    /System/Library/AssetsV2 Device/System/Library/AssetsV2
    /System/Library/PreinstalledAssetsV2 Device/System/Library/PreinstalledAssetsV2
    /System/Library/CoreServices/CoreTypes.bundle/Contents/Library Device/System/Library/CoreServices/CoreTypes.bundle/Contents/Library
    /System/Library/Speech Device/System/Library/Speech
    /Users Users
    /Volumes Device/Volumes
    /cores Device/cores
    /home Device/home
    /mnt Device/mnt
    /opt Device/opt
    /private Device/private
    /sw Device/sw
    /usr/local Device/usr/local
    /usr/libexec/cups Device/usr/libexec/cups
    /usr/share/snmp Device/usr/share/snmp


It's interesting that /usr/X11 isn't in that list!


Well, X11 hasn't been provided by Apple since 10.8, about 7 years ago. So it wouldn't be touched by OS upgrades anyway (it's a separate product).


/sw isn't provided by Apple either, but they provide it so that Fink will work.


On my older High Sierra box, /usr/X11 is a symlink to /opt/X11. Catalina ships with /usr/X11 symlinked to ../private/var/select/X11. So it should still be possible to get X11 to work.


> I don't understand the use case for firm links

Haven't watched the video yet, so I'm speculating. We know APFS doesn't support directory hard links, which means Time Machine can't back up to APFS. I'm guessing firm links are designed to fix this problem (perhaps among others).


I suspect they will eventually migrate time machine backups in one go to APFS+snapshots, then use the new ASR tool also mentioned in this talk to stream local time machine snapshots to the volume.

I also suspect that will launch 'soon' (after September but possibly in March) , but I have to imagine the complexity and testing requirements for such a in-place migration are staggering.


With APFS supporting proper snapshots, I don't think you need directory hard links any more.


Why is Linux file system never really improving when there are so many potential developers? In the last 15 years I've only switched from ext3 to ext4 which isn't much of a change.

And the only alternative I hear recommended is xfs which is also around forever but not advanced, zfs is license restricted and btrfs isn't widely used enough to consider using it daily.

It's sad Linux is still stuck with basic file systems without any of the modern features to this day even though there must be huge demand for a more modern file system that can go with kernel GPL2.


So containers with a new bi-directional hardlink "Firmlink" (intended to be invisible to applications, link between encrypted containers), with read only volumes protecting system software.

ZFS send/receive style volume serialization for replication backup, etc. File system snapshot support included.

iOS or should I say iPadOS supports external USB media and SMB.

What seems to be missing is block level checksumming and integrity checks, scrubbing, block level duplication to protect important files, etc. No whiff of deduplication either, but that might be too memory intensive for Apple's purposes.


My understanding is that deduplication has generally been not worth the effort for a general use filesystem. The cost in memory, cpu, and sheer complexity doesn't pay off in enough storage savings. Disk space is relatively cheap. Filesystem crashes/corruption are not.


Yeah, block level deduplication is not useful for files on a typical consumer laptop. I can imagine it's even less useful for files on iOS devices.

However it is often great for virtual machines and other software development related things, as long as all the underlying block sizes and alignments match.

You really do want strong integrity guarantees though. One corrupted block can cause a lot of damage.


The filesystem encryption used on iOS probably makes deduplication even less effective.


Depends on the layering order. You can do both dedup and encryption in the same time, if the encryption occurs on a lower layer.


Encryption in iOS is per-file. I'm not sure the OS really wants to go ahead and individual decrypt large numbers of files just to check if there are any duplicated blocks.


Not only that, but a Server with 100 VM's is going to have a substantial percentage of each OS image as identical...not so much on an average consumer laptop.


Disk space isn’t cheap on a Mac laptop.


Yeah, but that's a customer cost, not an Apple cost.

In any case, dedup just makes things more confusing to the customer— like, "you have X GB in files on your computer, but more than N-X GB free space because of magical deduplication." Or worse, "You deleted all those files, but didn't actually free up any space because they were dupes. So sad."


> In any case, dedup just makes things more confusing to the customer— like, "you have X GB in files on your computer, but more than N-X GB free space because of magical deduplication." Or worse, "You deleted all those files, but didn't actually free up any space because they were dupes. So sad."

For what it's worth, this can already happen in APFS. While they don't have real deduplication, if you copy a file within Finder, that copy won't take up extra space.


How granular is the COW? If you touch a byte, does just a block get copied or the whole file?


It depends on the application. Often applications do a kind of COW on their own, behind the scene they write out a new file with a randomized naming scheme, when that succeeds, do a rename/replace. So in this case you're getting a new file, as in, a file with a new inode.

If your application is smarter, then it could just COW one block, whatever the minimum block size is for that file system. e.g. for Btrfs it's page size, so x86 that's 4KiB, ppc64le it's 64KiB. I'm not sure what it is for APFS, I'd guess 4KiB. It could actually be inefficient depending on the use case, e.g. a 1MiB file with a single byte change in the middle, if it were a single 1MiB extent, becomes three extents with COW:

    0-500KiB - extent 1
    500-504KiB - hole where stale data is
    504KiB-1MiB - extent 3
    4KiB   - extent 2, with new data, a copy of 4095 bytes from the above hole, with 1 bytes of new data.
Edit: 2nd paragraph suggests the application would COW if it's smarter; that's not correct. If it's smarter, it would request an overwrite of some portion of the file and the file system is what would COW (if it's a COW file system). Granularity depends on a combination of application and file system.

Edit 2: More detailed answer on Btrfs. The whole file system on a zram device is about 260 lines using 'btrfs inspect-internal dump-tree' command. I'll excerpt just the file extents.

1. 'dd if=/dev/zero of=/mnt/test/bunchofzeros.txt bs=M count=1' results in this:

    item 7 key (257 EXTENT_DATA 0) itemoff 3409 itemsize 53
    generation 7 type 1 (regular)
    extent data disk byte 5283840 nr 1048576
    extent data offset 0 nr 1048576 ram 1048576
    extent compression 0 (none)
2. Using vi to change one character and save as the same file, ':wq', I get:

    item 7 key (262 EXTENT_DATA 0) itemoff 3409 itemsize 53
    generation 9 type 1 (regular)
    extent data disk byte 7401472 nr 1052672
    extent data offset 0 nr 1052672 ram 1052672
    extent compression 0 (none)
So you can see it's a new file, new inode, and looks like length is changed too, maybe something to do with encoding.

3. Delete that file and create a new one with the same 'dd' command as in 1.

    item 7 key (266 EXTENT_DATA 0) itemoff 3409 itemsize 53
    generation 11 type 1 (regular)
    extent data disk byte 5283840 nr 1048576
    extent data offset 0 nr 1048576 ram 1048576
    extent compression 0 (none)

    [root@flap ~]# dd conv=notrunc if=/dev/urandom bs=1 seek=524290 count=1 of=/mnt/test/bunchofzeros.txt 
    [root@flap ~]# hexdump -C /mnt/test/bunchofzeros.txt | more
    00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    *
    00080000  00 00 5d 00 00 00 00 00  00 00 00 00 00 00 00 00  |..].............|
    00080010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    *
    00100000

    item 7 key (266 EXTENT_DATA 0) itemoff 3409 itemsize 53
    generation 11 type 1 (regular)
    extent data disk byte 5283840 nr 1048576
    extent data offset 0 nr 524288 ram 1048576
    extent compression 0 (none)
    item 8 key (266 EXTENT_DATA 524288) itemoff 3356 itemsize 53
    generation 14 type 1 (regular)
    extent data disk byte 5242880 nr 4096
    extent data offset 0 nr 4096 ram 4096
    extent compression 0 (none)
    item 9 key (266 EXTENT_DATA 528384) itemoff 3303 itemsize 53
    generation 11 type 1 (regular)
    extent data disk byte 5283840 nr 1048576
    extent data offset 528384 nr 520192 ram 1048576
    extent compression 0 (none)
Clever. item7 and item9 actually reference the same original 1MiB data extent, but use offsets to point to the unchanged portion "creating" a 4KiB hole, and that hole is filed by item 8, a 4KiB extent containing the change. So it really did only COW one block.


Sorry, I can’t tell, are you still talking about APFS?


No the example under Edit 2 is Btrfs. On the details I'd expect APFS to differ. But the gist is that how changes manifest on disk do also depend on the application being used. On any file system, an edit in vi means a whole new file is written out, so it's effectively COW at an application level. Whereas an edit with an application that does byte level overwrites, would get translated into COW on a COW file system like Btrfs, APFS or ZFS. And I'd expect this COW operation to operate at the file system's minimum block level size.


Just a block


Or even worse: "You no longer have any free space on your hard disk because you made a change to a large file that had a deduplicated copy."


Block level deduplication should limit the impact of this case.


Only if the change is localized to specific blocks.

Besides, block-level dedup is expensive to maintain. Unless you're expecting to find a lot of duplicates -- which probably isn't the case on most end-user devices -- it's a net loss.


Expect about 30%, most of it in smaller system files and libraries. Block-level dedup is great for pretty much anything where you have a consistent workflow.


A consistent workflow sounds like something you’re going to find more regularly on a server than a mobile device or home computer.


In my opinion storing, not deleting, is the most important job of a file system. Increased efficiency and free space is a better proposition than replicating the average user's mental model of free space in the volume.

(If my opinion holds true, then deduplicating is a problem when it comes to storage: a copy made to prevent bitrot/other damage to the original data doesn't work as such.)


> Increased efficiency and free space is a better proposition than replicating the average user's mental model of free space in the volume.

I'm not sure if I agree.

In an ideal world, it makes sense that performance should be paramount. In reality, I think it's often more important that computers do what we expect them to, so we can predict when they'll break and know how to resolve the problem when they do. This is all the more important in consumer software.

"My hard drive is out of space" is a very basic computer problem that most users will run into at some point. The intuitive solution is to delete stuff. If that solution doesn't work, it could be a major problem.


Storing vs deleting is one of the reasons that I'd love to see more research done on log based filesystems.


It's cheaper than the memory upgrade you need to support dedup properly.


Not really. ZFS dedup is just brain-dead stupid. Rabin-Karp chunking plus spatial locality / opportunistic dedup allow you to gracefully scale down: you keep a complete table of successfully deduped blocks, plus a small fraction of other blocks, like e.g. smallest 1% of hashes, first blocks in files and final blocks in files. If you have a 12 MB (100 blocks) stretch that matches, you have 63% (1-1/e) chance of discovering one of these blocks as a match, and then use spatial locality to dedup adjacent blocks (duplicate blocks tend to come in bunches).


Then why does turning on dedup on ZFS suddenly make it chomp through 4GB of memory? Is there some tuneable you're supposed to be setting to make it less insane?


They do RAM compression, I thought disk compression was soon to follow after Apple went all-flash storage


APFS does indeed do compression, the rationale being that compute is cheaper than I/O.

Even if most files on a typical personal computer are already inherently compressed (JPEG photos, lossy audio, etc.), you're not losing any performance, and are gaining some extra space where possible.


Presumably not particularly valuable since most of what takes up consumer storage these days (pictures, videos, app assets) is already in some highly compressed format.


Relative to what? I've been around long enough it's hard for me to see just about any disk storage as expensive even if it costs more than whatever is on sale at Newegg.


Mac laptops aren’t upgradable without goofy adapters to convert nvme ssd m.2 to apple’s form factor, at which point your case won’t close up without flexing the aluminum or main board or both. For many MacBook Pro users, upgrading storage is buying a new laptop.

That’s expensive in my opinion.

OWC has made drives that fit Apple’s form factor (assuming your laptop is old enough) for a nice premium. Depending on which laptop you have, this can still be a science experiment.


Same with transparent compression in most cases.


It is a real bummer that there are still no content checksums available. One can work around this by doing sums and scrubs in userspace by storing mtime+hash in an xattr and periodically verifying and updating them. There is a utility to do this but I forgot the name.


For a company which primarily sells consumer products it makes sense to omit hardware checksums: You just detect errors which the end user might never notice, a bit flip in a picture you never look at again might not be that bad.

For anyone who has ever heard of checksums this makes APFS a no-go of course.


I guess NTFS is also a "no go" for file storage then as well.

What about ext4? Oh wait - no file data checksums - so a "no go".

Actually, it turns out plenty of folks have built plenty of storage on these platforms - including APFS.

In particular, for APFS, apple uses ECC for all hardware, uses fletcher checksums for metadata and I believe does a copy on write approach to help handle sudden power loss events etc. It seems to be working out OK for them given the prices they charge.


Just because you can't currently see any problems doesn't mean you're not surrounded by them. You can't see the problems because no one has turned on the lights.


Your point that other filesystems get by without checksums stands, but Apple does not use ECC for all hardware.

Also, we are left to wonder if their "reliable" SSDs are as reliable as checksums at the filesystem layer (or as reliable as both). I suspect the answer is yes, for the majority of applications running atop APFS (which is to say, iOS). Probably not for video assets on that Promise MPX bay (i.e. "external" non-Apple storage media).


I do t think zfs can handle memory correction without ecc either, so is no better in this case.


I use removable storage. SD cards, USB flash drives, USB hard disks.

Apple’s hardware choices are irrelevant.


Those other file systems were designed a while ago.

If you're going to design one now, is there any reason to not have a checksuming capability? At least as an option, even though it may be default-off.


All the next gen, copy on write filesystems have checksums: NTFS+ (ReFS) has it, ext4+ (btrfs) has it, but HPFS++ (APFS) has not. Bummer.

At least optional checksums should be supported, then watchOS does not waste time with it but a user with 10k+ mac workstation who might still uses spinning rust for bulk data storage is SOL.


> NTFS+ (ReFS), ext4+ (btrfs)

That's.. an extremely confusing way to put it. ReFS and btrfs are from-scratch new FSes.


is ext4+ really shipping in volume? I checked RHEL, not the default.

Is NTFS+ the default on Windows 10 / Windows Server 2016 or 2019?

APFS is shipping today in quantity / as default and I believe they back migrated literally millions of devices to it something like a year ago.

I don't understand the user use case with a 10K workstation and no raid (ie, with parity etc) + SSD with internal error correction. So this is a $10K workstation with magnetic disks and no other error correcting option / layer? Why do this?


SLES (Suse) is going all in with btrfs, it is no ZFS but better than the equivalent RH solution of offering an LVM or other blocklevel layer of checksums.

Not to mention all the ..BSD or ZfsOnLinux based storage solutions.

I am not denying that APFS is a good FS, but not even having the option of checksums given its contemporaries was clearly a political and financial decision to benefit Apple and not its users. Pointing to an opaque mix of raid (and indeed, raid10 is not even enough), super-reliable SSDs and the full stack control doesn't make the lack of checksums any better.


I don't really understand what the deal is with ReFS. Is this ever intended to reach normal consumers or are they happy with NTFS?


I’m thinking of moving back from ZFS to APFS if I had such a workaround. Block- or file-level integrity is now the only thing that keeps me back.


What are the use cases folks have with super high reliability needs?

I'm generally more worried about an SSD falling over entirely - so focus is on duplication (raid with hot spare -> offsite S3 backup).


From a user perspective it's effectively a zero cost feature. Apple should add it in.


OK - so they do a block level checksum (no error correction but it would let you know that something went wrong). Now something goes wrong with checksum. Is this a "zero cost" feature to have every app updated to handle this checksum error. What is gained? If you need reliability do something like SSD or a raid option and a backup.


The apps don't have to deal with it. It's a read error. Presumably something they already deal with. There's no operational difference for the end user between a block that can't be read and a block than can be read but with corrupted data. And yes, storage would ideally be paired with another physical device to provide redundancy and a mechanism to correct the error on the erroring device.


Or just pair it with iCloud and "cloud heal" the corrupted local data. Then mark the sector as bad and enjoy longer lifetime for the device.


I just built an external storage server on a supermicro 2U with 80TB of hdd and 2TB of fast flash running ZFS and connected to my Mac via 10GE. Best of both worlds.


https://github.com/ludios/bitscrub

This is one that does it with CRC32, which may or may not be sufficient for detecting bitrot. It is probably trivial to modify to support a stronger hash.


For internal storage Apple has control over the hardware. They can just choose hardware with better internal error correction/detection, so block level checksumming on the software side doesn't make much sense in this case. For external storage it would be nice, though.


You can't know how many uncorrectable errors you have if you're not check summing. You can't see because there are no lights. And IMO it's just negligent to believe your flash memory is 100% perfect.


However the flash memory knows the checksum on the hardware level, and running its own internal checks for data integrity.

Flash controllers reserve a certain amount of blocks for CRC and check summing, so flash is much more intelligent and knowledgeable than it seems.


> * However the flash memory knows the checksum on the hardware level, and running its own internal checks for data integrity.*

Assuming that the flash controller has no bugs and doesn't fetch the wrong sector/LBA.

The OS asks for LBA X, but instead Y is fetched, and the sum(Y) passes so the data is sent back as "OK". If you don't have checksums at a layer above the hardware, how do you know if the right sector was sent to you? (I've seen reports of ZFS catching errors like this.)


Yes, that's a possibility. Flash level checksum only guarantees that the written or read LBA is intact.

However, I've never seen an enterprise storage system to do a mistake like that. My archaic OCZ Vectors or Mac SSDs also didn't do something similar. The problem you mention is as sinister as bit-rot.

It's always nice that the FS is self aware about its integrity, however it may be overkill for consumer class devices and installations. Maybe making the flash controller and the storage itself more reliable should be better in the long run for these type of devices IMHO.


Flash level checksums also generally cover the flash translation layer (ie. the mapping of physical to logical sectors).

It's probably not needed for a vertically integrated stack of APFS on a Apple SSD controller.

It'd be nice to have fir external drives though.


> It's probably not needed for a vertically integrated stack of APFS on a Apple SSD controller.

Apple code has no bugs that could send back bad data? Apple has access to hardware that does not experience bit rot? Apple's gear is not susceptible to firmware hacking?

* https://www.wired.com/2015/02/nsa-firmware-hacking/

The various old school SAN vendors have all sorts of control for storage controllers and disk firmware, and I've seen ZFS checksum errors appear on scrubs with them.


> Apple code has no bugs that could send back bad data? Apple has access to hardware that does not experience bit rot? Apple's gear is not susceptible to firmware hacking?

'It' being checksums over data that they already know is checksummed by the F2L layer and they know the semantics of.

If you're hacking the firmware, you know that APFS is running on top of it, so you would just fix-up it's checksums too. APFS having checksums doesn't get you anything there either.

> The various old school SAN vendors have all sorts of control for storage controllers and disk firmware, and I've seen ZFS checksum errors appear on scrubs with them.

You have next to zero introspection into how those drives actually work, or where the semantics of their F2L breaks down the easiest (they hide all of that out of patent concerns). For instance do those drives work around a write ahead log primarily, or is it a more traditional block renaming tree? How many generations are there? All the BS counters in the world are next to meaningless without some base implementation knowledge that's missing.


Say you have a bug that corrupts data which somehow happens after writes so filesystem level checksums can detect it and you run that filesystem on desktop/laptop/mobile without redundancy. Does it actually help you with anything since you can't correct that error on the same filesystem level? Of course not. Does it at least prevent inconsistency? No, it doesn't unless failed checksums stop filesystem from doing new reads and writes. So the only sensible place for checksumming is on the level where such errors can be automatically corrected. Without redundancy that's just your disk controller or applications designed with data recovery in mind that also can do checksumming, not filesystems though.

Data doesn't fly from CPU to disk. It has to pass the RAM, storage controller, some caches most of the time and then it reaches to the medium that stores the data.

Data can be corrupted in anywhere in this chain. Checksum or block can end up in a bad memory cell, your storage controller can bit-flip or the universe can corrupt your data on the bus. Larger storage systems also add more connections, controllers, caches and disks. So the probability increases. This is why your flash, RAID controller, transmission protocols, HDDs, RAM and everything in between have error correction capabilities in enterprise systems. However consumer systems lack some of these.

Also, not all data ends up in a cutting edge SSD. HDDs are still present and being produced. While HDDs also have some block level error correction, they are susceptible to bit rot.

At the end of the day, FS level checksumming is desirable but, it costs space and computation. Instead of reducing the runtime and performance of the computer, consumer devices rely on the recovery mechanisms of the hardware and prevent most of the problems most of the times with good-enough resiliency and reliability.


> Data can be corrupted in anywhere in this chain.

Of course, but most of this kind of corruption is trivial to recover from with just retries, that's why checksumming there works, but not on the filesystem level, where it can only detect corruption. What makes a bit more sense is something like erasure coding. Filesystem with erasure coding would probably be useful for all the broken FTLs on SD cards, eMMCs, SSDs, but at the end of the day wouldn't provide any significant level of resilience for a single disk to be worth the effort and the overhead. And checksumming alone still won't be useful.


> Of course, but most of this kind of corruption is trivial to recover from with just retries, that's why checksumming there works, but not on the filesystem level, where it can only detect corruption.

FS-level checksum saves you from the situation when the underlying storage thinks everything is okay but it is not:

> This means there was a data error on the drive. But it’s worse than a typical data error — this is an error that was not detected by the hardware.

* http://changelog.complete.org/archives/9769-silent-data-corr...

* https://research.cs.wisc.edu/wind/Publications/zfs-corruptio...

* https://news.ycombinator.com/item?id=13851349 (via)

It's not an either-or situation, but rather both-and hardware- and FS-level checks.


> FS-level checksum saves you from the situation when the underlying storage thinks everything is okay but it is not:

It doesn't save you from anything, it just turns bit flips into full blocks of data loss and introduces overhead. If your application can deal with data loss it can deal with bit flips too. Filesystem-level checksumming is still useless.


It tells you that your data is corrupted, which gives you options on dealing with the situation.

> If your application can deal with data loss it can deal with bit flips too.

No, it may not. The file may mostly be okay except for a place where a single bit flip causes the value to go from positive to negative:

* https://en.wikipedia.org/wiki/Signed_number_representations

Hardware not-handling bit flips is a documented phenomenon per my citations above. It does cause issues further up the stack. FS-level checksumming gives people the option to at least detect these errors, and depending on the situation, also perhaps fix them.


People already have a much better option by doing checksumming if necessary on the application level, much closer to the source of data actually with even less possibilities to miss corruption.

Yes, doing it at the application level would be better. But doesn't preclude it doing it at other layers as well.

It's why we have Ethernet checksums, IP checksums, TCP/UDP checksums, etc. Defense in depth.


We don't have IP checksums, and only have TCP and UDP checksums because it's optional on Ethernet, and TCP/UDP doesn't always run over Ethernet anyway.


> Of course, but most of this kind of corruption is trivial to recover from with just retries...

Retries won't help you when the data was corrupted on its way to be stored.


> Retries won't help you when the data was corrupted on its way to be stored.

Why not? This is exactly how such corruption is traditionally fixed. On crc error data is rejected before it is stored and retry then fixes the problem. And it's not that big of a deal either, only applications can get data checksummed in the earliest possible moment anyway. The hard part is automatically recovering from data loss and corruption later, which also makes it even unnecessary to verify if data is stored without corruption.


> > Retries won't help you when the data was corrupted on its way to be stored.

> Why not?

1. Data is fine in memory

2. Data is fine over PCIe.

3. Data is corrupted by firmware bug in storage controller.

4. Data (now corrupted) is sent to disk, which generates checksums for each LBA, and sends the bits to the storage medium.

5. Some time later the data is requested, disk reads bits from sectors, checksums pass, corrupted data is sent back up the stack to the application.

This is not a fantasy scenario:

* http://www.cs.toronto.edu/~bianca/papers/fast08.pdf

* https://storagemojo.com/2007/09/19/cerns-data-corruption-res...


You are missing the point: checksums can't save you from data corruption and data loss. They are unnecessary without automatic recovery mechanisms and even make things worse.

> You are missing the point: checksums can't save you from data corruption and data loss.

No, I am not. The point is: checksums allow you to know that data corruption happened in the first place. Knowing this, you can then make informed decisions.

What saves you from "data corruption and data loss" is backups. But how do you know you have to go to backups if you're not aware that your data has problems?

> They are unnecessary without automatic recovery mechanisms and even make things worse.

Checksums allow you do know that there is a problem in the first place. If you don't know there's a problem, how can you invoke "recovery mechanisms", automatic or otherwise?

If you use (say) ZFS with some level of RAID-Z over multiple disks, then you get an "automatic recover mechanism". But even for a single-disk situation, it would be handy (for me at least), to know a file is bad so I can restore it from (say) my Time Machine backup.

But I have to know that the file is corrupted in the first place. And checksums would give me that.


> What saves you from "data corruption and data loss" is backups. But how do you know you have to go to backups if you're not aware that your data has problems?

It doesn't actually. It introduces inconsistencies unless on first detected corruption your filesystem stops all reads and writes until you recover corrupted block from backup and you might not even have it in backup, because backup is not synchronous replication.

> Checksums allow you do know that there is a problem in the first place. If you don't know there's a problem, how can you invoke "recovery mechanisms", automatic or otherwise?

You do your own appropriate checksumming suitable for recovery mechanism that you have. It's hardly as primitive as you think, because you have to respond fast, without corruption, transparently on request, you also have to schedule a check whether you actually need to heal anything, maybe it was a transient error, then mark it as corrupted, find what to resync from where, all of which also has to be fast, which means a bit complicated algorithms with sort of checksum databases on each replica, etc.


What do you do as a living? Because I think we are living in different realities.

I, as a sysadmin, have been helped by (ZFS) checksums multiple times. This includes times where I had to go to tape manually because the really expensive SAN has said "everything is fine" when it wasn't.

So telling me "checksums do not help" is contrary to my life experience.


I mostly speak from experience designing, implementing and operating a distributed database/storage system and traditional storages, databases before that.

And you've never seen the storage stack give back bad data?

You seem to be saying "checksums don't matter because there may be nothing we can do about about the corrupted data". So we should just blindly accept all the bits from the lower levels of the stack? We should do nothing to verify that those levels of the stack are working correctly?

You've mentioned application-level checksumming: yes, that would be good. But until every single dev adds code to checksum (and correct?) their data structures and file formats, having more checksums allows for beter information about when corruption occurs IMHO.


My point is checksumming is not a solution for data corruption, never was, never will be. It's only a tiny part of more complex algorithms and systems that detect and fix corrupted data automatically, never something usable on its own.

Checksumming is necessary, but not sufficient, for proper data integrity. But if you don't have it, you can't do anything else.

So why does it seem you are agruing against checksums?


> Does it actually help you with anything since you can't correct that error on the same filesystem level? Of course not.

Given that I run Time Machine, yes it does: it tells me I have to restore the file from a backup.

I also use SuperDuper to create bootable clones on a rotating set of drives that I take to work for offsite backup, so the file would probably be there as well. (Though this is a bit extreme for a non-geek to do.)


Flash level checksums are an integral part of wear-leveling and flash block allocation strategy. So they are also used for block migration/retirement controls. In other words, it's not trivial or desirable to remove flash level ECC or other checksums from physical flash management layer.

I cannot find the article now, but the flash storage is way less reliable than it seems and it's correcting bit errors silently without reporting most of the time.

This article [0] shows some of the things behind the curtain in flash storage configuration and lifetime. I'd edit or add the other article if I can find it.

[0]: https://goughlui.com/2015/04/05/teardown-optimization-comsol...


I'm saying that flash level checksums on a vertically integrated stack make checksumming again at the FS layer moot, assuming you have proper control and reporting on both sides of the block device abstraction boundary. Apple's about the only one who can pull this off with their vertically integrated stack.

> ...flash level checksums on a vertically integrated stack make checksumming again at the FS layer moot...

This is simply not true. First, flash is not always the place data ends at. Flash is prevalent in more points in the storage chain. Predominantly on RAID controllers' caches in last 3-5 years and hybrid drives used in laptops. So, flash-level checksumming can guarantee that the data is retrieved in the form it's written and allows neat tricks like wear-leveling and such. On the other hand data can get corrupted in other places.

e.g. Data can get corrupted in RAM, bit-rot in hard drives, RAID controllers can add a thick layer of complexity which you cannot manage or the filesystem can just make mistakes due to bugs (I've seen it all). So flash-level checksum can protect the data on the flash, but cannot guarantee that the data is correct in the first place. So if you checksummed the data and wrote it differently, FS or RAID level checksumming will catch it. This will prevent domino effects which the corrupted data can create.

Data management gets complicated fast when you need more than a few disk's aggregate capacity with high performance and reliability and, FS level checksumming is a powerful tool to ensure that.

> ...assuming you have proper control and reporting on both sides of the block device abstraction boundary.

Again this will only guarantee that the device reads the data it has written, but it won't guarantee that it had received the correct data in the first place.

> Apple's about the only one who can pull this off with their vertically integrated stack.

While Apple can do a lot of exotic stuff with the hardware, it won't be wise to create a custom flash controller and protocol to fuse FS and flash level checksums. It'll increase the software and integration cost, limit their options in hardware scene, slow down product development and plainly limit them in the long run.

I had an Intel 4 channel RAID card which corrupted the array when it wakes up from the wrong side of the bed and, it spent hours to verify and rebuild the array from the block level checksums it created on the disks during array creation. I've never lost data, but decided that it didn't worth the effort to keep the array alive and well. So, I eventually ditched it. However its activity LEDs on the board was kinda nice.


> While Apple can do a lot of exotic stuff with the hardware, it won't be wise to create a custom flash controller and protocol to fuse FS and flash level checksums. It'll increase the software and integration cost, limit their options in hardware scene, slow down product development and plainly limit them in the long run.

I mean, tell that to them, because that's what they've done. The T2 chip's feature set is very married to APFS. AFAIK, Linux still doesn't have working drivers for it despite it sort of looking like an NVMe drive.


Anyone can pull this off, because filesystem level checksumming without an additional place to recover data from is useless.

Exchange the checksumming with hamming codes or with another ECC system with >=2 bit detection/correction capability and you're all set.

Not really set, but even assuming you can make it work this is already way beyond checksumming the lack of which people complain about for some reason.

Most of the big storage vendors are making it work with Erasure or Hamming Codes. They compensate damaged or missing disks in the arrays transparently and send you a small e-mail about the details of the disk you need to change.

RAID-5 and RAID-6 is the most primitive and probably most deployed version of this checksum based compensation and recovery methods.

Erasure/Hamming is used for bigger arrays where RAID-5/6 cannot build the disks in a reasonable amount time while they are under I/O load.


Pretty much every NAND flash device out there uses full ECC rather than just a checksum. The whole industry relies on it for yield reasons.

> However, I've never seen an enterprise storage system to do a mistake like that.

Unless you're using a checksumming file system (a la ZFS), how would you know? And even if you were using ZFS, it would probably just do a re-read from another part of the mirror/RAID-Z and fix it behind the scenes.

That's the issue with people saying "things are fine": how do they actually know since we're not verifying things?

Up until fairly recently, we haven't really been looking too closely, simply trusting the lower levels of the stack:

* http://www.cs.toronto.edu/~bianca/papers/fast08.pdf


Their flash translation layer has block level checksums. Most of the time that's not enough, because the SSD vendor doesn't give you a lot of introspection into their failures or how their F2L works. But this is something you gain from vertical integration like Apple's.


I thought the storage itself had the integrity check.


I'm sure it has some special Apple magic, but I really think the hardware-only approach isn't the path forward. Hardware faults happen, and papering over that fact with un-debuggable, un-observable firmware is not a good choice.


iOS on iPhone also supports external USB media and SMB.


I was surprised to see this. From the original keynote, it seemed like an iPadOS-only option, but this video made it clear that it applies to both.


There's still a lot of effort to keep subsystems and application code-bases the same - iPad OS I believe is mostly a sign that there may continue to be UX deviation (widgets on home screen, drag and drop between applications, split view multitasking, etc)


Each supported device has always had its own image, so I assume it’s not exactly the same codebase.

The only reason I know this is because the last time I had to upgrade iOS using iTunes - going from iOS 4 to 5, I had three separate devices that required separate downloads for my iPhone 4, iPod Touch, and iPad.


And Samba shares, which from my perspective is even more interesting.


Minor correction. Will support USB media in iOS 13.


Thats probably very neat. Got rid of .DS_Store files yet?


And the .Trashes, which always manage to make my camera SD card full even after “deleting” everything. I’m not sure that there is a good solution to this other than to always use permanent deletion on removable volumes, like I think Windows does.


> And the .Trashes, which always manage to make my camera SD card full even after “deleting” everything.

The trash isn't magic. You delete stuff, it goes into the Trash. The Trash is represented by a ".Trashes" directory on removable media. You can delete it manually or just empty the trash.


My point is: Joe Average takes some photos with his camera, finds that he’s out of space on his SD card, pops it into his MacBook to transfer over a couple of good photos and delete the rest, and when he plugs it back into the camera it’s still full. Finder doesn’t show anything, and neither does the camera’s image review function. I don’t think that looking to empty the Trash is an intuitive next step, since it seems to me that the Trash wouldn’t be on the SD card.


> Finder doesn’t show anything

Finder shows a full "trash can" in the dock.


Joe Average uses the Photos app, or the Image Capture app, which moves the files and doesn't trash them in the first place.

Joe Average probably doesn't even know what a "file" is.


Maybe it could be, now, with snapshots.


.DS_Store isn't part of the file system per se, it's part of Finder's operation.


This is configurable, like Thumbs.db in Windows.


Still nothing on transparent compression? It's one of the main reasons why I keep a ZFS partition on the Macbook.


There is afsctool, git clone and build from https://github.com/RJVB/afsctool or brew install afsctool (an older version).


> Read-only state of the system volume can be disabled but not persistently, will revert to read-only after a reboot

Wait, what? Even with SIP off?

Can it be disabled from within the OS? Can I add "command-to-make-system-rw" to a launchd plist that runs at load on my machine?


I think having persistent access is the file system equivalent of logging in and working entirely from root.

The default should be a restricted system, with intentional, temporary, jumps into unrestricted access for making whatever infrequent system configuration changes.

What’s the use case for constant RW access? What changes at this level so frequently?


> I think having persistent access is the file system equivalent of logging in and working entirely from root.

Except that those system files are already protected behind root privileges, so they're only read-write if I have root. And, one reason the root privilege system works is because temporarily gaining root when necessary isn't painful.

I indeed don't change root files on a daily basis, but when I do, I don't want to have to go through an extended rigamarole every single time. Typing in my password should be enough.

I totally respect that Apple wants to keep normal users out of this stuff. I just want a way to remove the safety wheels one time, instead of again and again.


I've shot myself in the foot enough to appreciate an ever-lengthening list of hurdles I need to jump to do so again. I would be protesting if there were no way to make the rootfs read-write, but as it is I'm happy with having to perform a command or two when I want to write.

I suspect you can do that in a launchd script, but however much more involved it is than typing my password, I'm probably fine with leaving it as-is.


Disabling SIP, at least from Apple's POV does most of what you want. They've also implored people to be good UNIX citizens and make liberal use of /usr/local, which they explicitly give you - the user of the hardware - ownership of.


> They've also implored people to be good UNIX citizens and make liberal use of /usr/local, which they explicitly give you - the user of the hardware - ownership of.

/usr/local is root:wheel by default. macOS provides it to you as a place to put your software, but it not give your user ownership of this directory.


> but it not give your user ownership of this directory.

Admin users can sudo, sudo should be used to install software. Your thinking is the shenanigans that Brew does, making `/usr/local/whatever` writable by anyone, without a password.

This is not how smart people install software.

Edit: wow what a typo. One character completely reversed the intend meaning of the last sentence.


> This is now how smart people install software.

On the contrary, I think what Homebrew does with regards to permissions is fundamentally incorrect. /usr/local, being shared, should be owned by root, and it should require administrator permissions to install software there.


If I shared my Mac with anyone I’d agree, but as it is, my Mac is a single-user machine and I’d rather run brew as myuser:mygroup to avoid running the install scripts as root.


With Brew it's a moot point - you can't choose to do otherwise, it will refuse to run as root, and has no capability to do the "normal" `make && sudo make install` pair, where you build as user and install as root.


I'd rather not run a random `make install` (or, in Brew's case, as random ruby script) as root, though.


And that's what a user ~/bin directory is for then - you don't want to run as root, install in your home.

Installing user-own software, in a user-specific location is fine.

/usr/local is not user specific, and setting the permissions so it is treated that way doesn't stop other software referring to it as a system-wide $PATH.


There is no ~/bin directory on macOS though. Homebrew could create one, but iirc Apple's guidelines discourage creating new directories in the user's home folder. (Which I generally agree is a bad practice, btw, although this might be an exception.)

I guess you could put something in the ~/Library folder, although that's not ideal either...


There's no `/usr/local/Brew` (or whatever directory it uses) either, but it creates it, and it changes the ownership of `/usr/local/bin` to make it writable without a sudo prompt - which is more egregious than creating a `~/bin` directory, any day of the week.

If they didn't insist on using `/usr/local/bin` (which is in the default $PATH) the permissions issue would be much less of an issue IMO (not a non-issue, but less of an issue than it is currently)


Gah, sorry typo in my comment, it should say "not", but it says "now.

Anyway, I agree, it should be root owned and use sudo/similar to install.

Sorry for the confusion!


I should have been clearer about this^

On most systems with 1 user it's irrelevant as by default they are in the admin group. Rather, usr/local exists and you may read and write into it even if you don't have explicit ownership (e.g, root). Whether that's brew, or your own script doesn't really matter.


On Mojave at least, `/usr/local` is owned by root:wheel, the first user account (which has admin rights) is not in the wheel group but can use sudo to write to `/usr/local`.

I happily make /usr/local owned by myuser:mygroup—on a single-user machine that’s no more dangerous than putting ~/bin into $PATH


If you edit your own $PATH, it literally only affects your user account.

`/usr/local/bin` is in the default $PATH, so making it world writable means anything malicious installed there may end up being called by a system utility that runs as another user than your own.

For your (and Brew's stated goal), installing into ~/bin would be a better result, because it's literally that single users account.

But the Brew developers never want to hear any criticism of their ridiculous bullshit, whether it's security issues or their half-assed dependency resolution.


rw file access doesn't mean there's a lack of file level permissions restricting a non sudo user from accessing an otherwise restricted file. case in point: try editing the /etc/hosts file without sudo. you can't save (without invoking sudo from within vim).

ro means no matter what I try, that file is ro unless the fs is remounted rw. Again, it doesn't negate file system level user/group permissions. and remounting a system level partition rw isn't something I'd do on a whim.

I like the idea of a ro partition for system level OS files, personally, but without user level access somehow - it's not something to be looked forward to.


> I like the idea of a ro partition for system level OS files, personally, but without user level access somehow - it's not something to be looked forward to.

Apple is clearly stating that the user will have access, so I'm not concerned on that front.

What bothers me is the "will revert to read-only after a reboot" part, which seems unnecessarily punitive. It's like Apple is saying "We're going to let you do this thing, but we don't like it, so we're going to make you redo it every single time to remind you of our disdain."

Elsewhere in this thread, Pwinnski said he likes having an extra, persistent safety check. I think that's great. But, an extra command line flag to make the setting persistent really shouldn't be too much to ask for.

If my launchd idea works in practice, this is all moot, and I retract all my complaints. I'm just nervous.


How often do you need to touch the system partition in the first place? Is this something you need to do every boot?


From the video, you can only enable read/write if you disable SIP in the first place. They didn't answer say anything about the second question.


Hopefully somehow maybe while they are in there they might do something to fix the nasty system lockups I have seen when starting both time machine and carbon copy cloner backups since switching to APFS. I'm assuming it is related to having >5 million files, but haven't figured that out for sure.

The symptoms are the system becomes almost totally unresponsive for up to 5 minutes to the point where you can't even drag and drop an existing window sometimes.


What physical medium do you use (HDD, SSD)? I had a faulty HDD act up on me like that.


SSD on a 2016 macbook pro. The timing coincides exactly with the FS (and OS...) upgrade.

What percentage of that do you think is Disk I/O vs lock contention?


I wouldn't expect that level of disk IO to cause issues dragging existing windows around without some deeper lock contention or something, it does not behave like just IO bound. Stuff being slower, totally, but not to this level. It was never an issue before the filesystem upgrade.


Anyone know how they made these slides? I'm speculating it's just Keynote with San Fransico Bold or something.


According to the file's meta information, the content was created using Keynote and then converted using macOS Quartz PDF.


Apple's public slides are always made in Keynote using an internal template for presentations.


I am left with the feeling that in 2019 the filesystem is the weak part of the operating system.

* mmap doesn't really work well on 64 bit systems * mmap will block if it has to page data in from a file * there are ten different backends to handle layering on Docker and that simple fact implies that none of them are good; if you could compose layers arbitrarily the speed and scalability of Docker would be in a different league than it is now, but you can't. So Docker is just as easily something that slows you down as speeds you up. * Filesystem metadata scans are slow, shockingly so on Windows.

I know I like developing with S3-style object stores. To some extent this can replace the traditional filesystem, in other ways it can't.


> there are ten different backends to handle layering on Docker and that simple fact implies that none of them are good

overlay2 is fast, it's in the mainline kernel, and it's now the default for new Docker installs. I expect all the devicemapper / aufs mess to be history soon.


Any idea how this might affect (or help) projects like iSH, or similar?

I've been fighting to be productive on my iPad, and currently that takes the form of a remote terminal.

I've been craving the ability to run a proper Unix shell locally.


I highly doubt that you're going to get a proper shell on the iPad.


I'm assuming you mean a shell into the iPad's own Unix environment. You could build something for iOS that works like Crouton does on ChromeOS just fine. (Just, nobody has done it yet, for some reason. The iPad Pro is plenty powerful enough to run VMs!)


I don't think it's as easy as crouton. There's no chroot or user namespace equivalent for XNU.


iSH does this by emulating Linux rather than virtualizing it.


Both Pythonista and OpenTerm on iOS are "good enough" for a lot of what I need, but it would be nice to have a proper shell. Even if it's jailed/locked to /private/var/mobile


No effect at all. Apple is unlikely to make this possible, and given the sandboxing of Apps on iOS a meaningful version of this for local resources outside of the App are prohibited.


Agreed. There are some basic terminals available on iOS/iPadOS though, with ssh and other capabilities, but I can't speak to how productive one could really be with them. I wholly expect to see a version of Xcode released on the iPad within the few years, though.


I use Blink[1] to Mosh to an EC2 instance where I use Jekyll, Vim, and Git to edit a static website that’s built and deployed when I push master. I can transfer images from my Photos library using the local shell, which I then resize on the VM.

I’ve heard people speculating about “Xcode on iPad within a few years” since Swift Playgrounds was released. Hopefully it’s closer now, but I’m still skeptical.

[1] https://blink.sh


On a similar note, I hope an iOS (/iPadOS) app is in the works for https://code.visualstudio.com/docs/remote/remote-overview


iSH is not affected, since it currently only touches files inside of its sandbox.


nothing to address the docker mounted-volume slowness issue?


It is very unlikely that docker issues are a priority for Apple.


Have you tried using delegated mounts? I got the impression these helped performance, though I have not measured details.


Can you back to an APFS volume yet, or does it need to be HFS+?


I was really surprised when APFS came out that it didn't support time machine. From the looks of this, it appears they're going to be able to write a new-style time machine on-top of APFS with the additions here. Interesting, over-due IMHO.


I suspect they are waiting until they can migrate a time machine volume in one go (e.g. migrate the filesystem from HPFS with directory hard links to snapshots and a potential non-backup volume for other files).

They have a lot of deep time machine integration in other parts of the system, so I suspect the staging of the projects means it is pushed off to sometime next year.


Hearing about how snapshots work over the years during APFS sessions made me immediately think of 2007/2008 when they were describing the tech that eventually became Time Machine, but it wasn't a one click operation yet.

As a perverse experiment, do we know what happens if you take an APFS SSD and try to format it as a TM volume? Guessing it's still not pretty despite the fact that HDD's under APFS are PNG, but directory hardlinks are gone.

Does it work, but just fill up the disk - no deduping?


Time Machine refuses to back up to APFS volumes at present.


Mods: Can someone add [pdf] suffix to this please?


Added. Thank you!


[flagged]


Please don't post unsubstantive comments here.


[flagged]


There's probably a good point there but to be within the site guidelines you need to express it substantively and without snark. This comment is better than the GP that way.

https://news.ycombinator.com/newsguidelines.html


Aren't PDFs automatically marked anymore?


Title got renamed




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: