
APFS in Detail - knweiss
http://dtrace.org/blogs/ahl/2016/06/19/apfs-part1/
======
veidr
What a great and valuable post, especially since this info is the result of
talking to the APFS team at WWDC, and has not been published anywhere else
yet.

Of particular interest (to me) was the "Checksums" section:

    
    
        Notably absent from the APFS intro talk was any mention of
        checksums....APFS checksums its own metadata but not user data.
    
        ...The APFS engineers I talked to cited strong ECC protection
        within Apple storage devices. Both flash SSDs and magnetic media
        HDDs use redundant data to detect and correct errors. The
        engineers contend that Apple devices basically don’t return
        bogus data. 
    

That is utterly disappointing. SSDs have internal checksums, sure, but there
are so many different ways and different points at which a bit can be flipped.

It's hard for me to imagine a worse starting point to conceive a new
filesystem than "let's assume our data storage devices are perfect, and never
have any faulty components or firmware bugs".

ZFS has a lot of features, but data integrity is _the_ feature.

I get that maybe a checksumming filesystem could conceivably be too
computationally expensive for the little jewelry-computers Apple is into these
days, but it's a terrible omission on something that is supposed to be the new
filesystem for macOS.

~~~
ahl
Talking to the Apple engineers it really didn't seem to be an issue of
computation. They seemed genuine in their belief that they could solve data
integrity with device qualification. While I asked them 100 questions they
asked me 2: had I ever actually seen bit rot (yes), and what kind of drives
did we ship with the ZFS Storage Appliance (mostly 7200 nearline drives).

~~~
ghshephard
That would suggest that APFS is only relevant for internal storage procured by
Apple. Do they not intend for it to be used on external storage?

~~~
kevincox
They mentioned that it would be used on removable media as well.

------
amluto
> I get that maybe a checksumming filesystem could conceivably be too
> computationally expensive for the little jewelry-computers Apple is into
> these days, but it's a terrible omission on something that is supposed to be
> the new filesystem for macOS.

Checksumming has another cost that isn't immediately obvious. Suppose you
write to a file and the writes are cached. Then the filesystem starts to flush
to disk. On a conventional filesystem, you can keep writing to the dirty page
while the disk DMAs data out of it. On a checksumming filesystem, you can't:
you have to compute the checksum and then write out data consistent with the
checksum. This means you either have to delay user code that tries to write,
or you have to copy the page, or you need hardware support for checksumming
while writing.

On Linux, this type of delay is called "stable pages", and it _destroys_
performance on some workloads on btrfs.

~~~
sangnoir
For desktop computing, I'll take data integrity over good 'performance' _any_
day. The use-cases for iDevices might be different, coloring Apples
perspective.

------
gmac
Slightly worried by the vibe that comes off this. "I asked him about looking
for inspiration in other modern file systems ... he was aware of them, but
didn’t delve too deeply for fear, he said, of tainting himself". And (to
paraphrase): 'bit-rot? What's that?'.

I would have hoped that a new filesystem with such wide future adoption would
have come from a roomful of smart people with lots of experience of (for
example) contributing to various modern filesystems, understanding their
strengths and weaknesses, and dealing with data corruption issues in the
field. This doesn't come across that way at all.

~~~
cm3
Given Dominic's other output, I'm going to believe there's more to the story
because he didn't strike me as someone who would actively ignore past
innovations. I know it's a popular concept to NIH stuff when devs believe they
know enough to build it, but so much stuff is just built poorly without
consideration for existing designs and it shows in the poor software we have
to live with.

------
sho_hn
I'm extremely confused by this:

> With APFS, if you copy a file within the same file system (or possibly the
> same container; more on this later), no data is actually duplicated. [...] I
> haven’t see this offered in other file systems [...]

To my knowledge, this is what cp --reflink does on GNU/Linux on a supporting
filesystem, most notably btrfs, and has been doing by default in newer
combinations of the kernel and GNU coreutils.

This guy seems too well-informed and experienced in the domain to miss
something so obvious, though. So what am I missing?

Also interesting to me is the paragraph about prioritizing certain I/O
requests to optimize interactive latency: On Linux this is done by the I/O
scheduler, exchangable and agnostic to the filesystem. Perhaps greater insight
into the filesystem could aid I/O scheduling (this has been the argument for
moving RAID code into filesystems as well, though, which APFS opts against) --
hearing a well-informed opinion on this point would be interesting. Unless
this post gets it wrong and I/O scheduling isn't technically implemented in
APFS either.

It _seems_ like this perspective might be one written from within a
Solaris/ZFS bubble and further hamstrung by macOS' closed-source development
model. Which is interesting in light of the Giampaolo quote about
intentionally not looking closely at the competition, either.

~~~
ahl
This guy (i.e. me) wasn't aware of this functionality in btrfs. Are reflinks
commonly used? Yes, I know more about ZFS than the other filesystems
mentioned.

------
idorosen
In my opinion, APFS does not seem to improve upon ZFS in several key areas
(compression, sending/receiving snapshots, dedup, etc.). Apple is
reimplementing many features already implemented in OpenZFS, btrfs (which
itself reimplemented a lot of ZFS features), BSD HAMMER, etc.

Maybe extending one of these existing filesystems to add any functionality
Apple needs on top of its existing features (and, hopefully, contributing that
back to the open source implementation) would cost more person-hours than
implementing APFS from scratch. Maybe not.

Either way, we will now have yet another filesystem to contend with, implement
in non-Darwin kernels (maybe), and this adds to the overall support overhead
of all operating systems that want to be compatible with Apple devices. Since
the older versions of macOS (OSX) don't support APFS, only HFS+, this means
Apple and others will also have to continue supporting HFS+. It just seems
wasteful of everyone's time to me.

Also: [https://xkcd.com/927/](https://xkcd.com/927/)

~~~
mindajar
Classic comic, but I don't think it applies. APFS looks intended to solve
Apple's product problems really well, and it doesn't even try to be a
filesystem for everyone.

Apple has said from time to time that they're all about owning and controlling
the key technologies that go into their products. APFS makes a lot of sense
from that perspective, and this seems one of those cases where going their own
way is better than importing someone else's constraints. ZFS on an Apple
Watch? LOL.

~~~
tfar
I would not be surprised if one could write a ZFS implementation optimized for
more constrained devices. If you already know you are going to a have flash
storage you can probably ditch some of the N layers of cache you see in common
ZFS implementations. Not that ZFS is a one size fits all, but the file systems
specification could be implemented in more than one way.

~~~
mindajar
If you assume Apple cares about having a disk format in common with other
platforms, sure, I'd agree that's probably possible. But I don't think they
do; they seem to care a lot more about things like a unified codebase across
their platforms, the energy-efficiency initiatives they've been pushing for a
few years, owning the key tech in the products, etc.

One slide in the WWDC talk deck showed a bunch of divergent Apple storage
technologies across all their platforms that are being replaced by APFS. If
ZFS has to fork into weird variants to run well on the phone or watch, that
seems less appealing than a single codebase optimized for just the stuff Apple
products do.

------
ghshephard
_For example, my 1TB SSD includes 1TB (2^30 = 1024^3) bytes of flash but only
reports 931GB of available space, sneakily matching the storage industry’s
self-serving definition of 1TB (1000^3 = 1 trillion bytes)._

Great article, but a couple nitpicking corrections (which seem appropriate for
a storage article) Per:
[https://en.wikipedia.org/wiki/Terabyte](https://en.wikipedia.org/wiki/Terabyte)
\- Terabyte is 1000^4, not 1000^3.

Also It's been 6+ years since we all agreed that TiB means 2^40 or 1024^4, and
TB means 10^12. Indeed, _only_ in the case of memory does "T" ever mean 2^40
anyways. It's always been the case that in both data rates, as well as
storage, that T means 10^12. This convention is strong enough that we most of
us just have thrown up our hands and agree when referring to DRAM memory, that
Terabyte will mean 1024^4, and 1000^4 everywhere else.

Indeed, in the rare case where someone uses TiB to refer to a data rate, they
are almost without exception incorrectly using it, and, they actually mean TB.

------
TazeTSchnitzel
> Also, APFS removes the most common way of a user achieving local data
> redundancy: copying files. A copied file in APFS actually creates a
> lightweight clone with no duplicated data.

No, it doesn't. APFS supports copying files, if you want that. It's just that
the default in Finder is to make a “clone” (copy-on-write).

~~~
ahl
Fair enough; and right now cp doesn't use the fast clone functionality, but it
assuredly will. I'm not sure 'cat <file >file.dup' is reasonable for most
users.

~~~
ghshephard
Just teach everyone to use dd instead - dd if=file of=file.dup :-)

------
cm3
I'm still looking for a widely supported (at least FreeBSD and Linux kernels)
filesystem for external drives to carry around that doesn't have the FAT32
limitations. There's exFAT but no stable and supported implementation. Then
there's NTFS, but that's also not 100% reliable in my experience when used
through FUSE (NTFS-3G). I've considered UFS but that also was a no go. I'm
hopeful for lklfuse[1] that also runs on FreeBSD and givess access to ext4,
xfs, etc. in a way like Rump and allows you to use the same drivers on
FreeBSD. I'm cautious though, given that I don't want corrupted data I might
notice too late. Let's see if lklfuse provides LUKS as well, otherwise
Dragonfly's LUKS implementation might need to be ported to FreeBSD or
something like that. External drives one might lose need to be encrypted.

[1] [https://www.freshports.org/sysutils/fusefs-
lkl/](https://www.freshports.org/sysutils/fusefs-lkl/)

~~~
drvdevd
Thanks for this! I didn't realize (or had forgotten) LUKS had been ported to
Dragonfly. Also you touch upon my #1 frustration with APFS without really
knowing anything about it: simple portability.

~~~
cm3
Yeah, I believe it's by the same Dragonfly developer who also wrote tcplay[1]
for TrueCrypt volumes.

[1] [https://leaf.dragonflybsd.org/cgi/web-
man?command=tcplay&sec...](https://leaf.dragonflybsd.org/cgi/web-
man?command=tcplay&section=8)

------
niftich
The file-level deduplication [1] is interesting. Not being a filesystem
expert, this sounds like it fulfills a similar usecase to snapshots [2]. Or am
I reading this wrong?

Is NTFS's shadow copy like Snapshots?

[1] [http://dtrace.org/blogs/ahl/2016/06/19/apfs-part3/#apfs-
clon...](http://dtrace.org/blogs/ahl/2016/06/19/apfs-part3/#apfs-clones)

[2] [http://dtrace.org/blogs/ahl/2016/06/19/apfs-part2/#apfs-
snap...](http://dtrace.org/blogs/ahl/2016/06/19/apfs-part2/#apfs-snapshots)

~~~
rincebrain
NTFS Shadow Copies are more like LVM/ZFS snapshots than APFS's file-level CoW
snapshots, in that they both operate on the entire volume at block-level,
rather than having a per-file level of granularity.

There are other FSes that allow the behavior that APFS is demonstrating - look
at OCFS2 and Btrfs, both of which allow you to do cp --reflink.

------
amelius
I think the value of this new proprietary filesystem is limited, since you
can't run it on servers (Apple does not make servers anymore). Also,
compatibility/porting issues may become a problem if you build your software
for it.

