
A ZFS developer’s analysis of Apple’s new APFS file system - rayiner
http://arstechnica.com/apple/2016/06/a-zfs-developers-analysis-of-the-good-and-bad-in-apples-new-apfs-file-system/
======
tnorgaard
I would recommend anybody interested in filesystems to watch Jeff Bonwick (ZFS
inventor) explain the design of ZFS:
[https://www.youtube.com/watch?v=NRoUC9P1PmA](https://www.youtube.com/watch?v=NRoUC9P1PmA).
They have a few very nice war stories explaining why they found it useful to
have the user data checksummed as well.

~~~
espadrine
I might be wrong, but I read APFL's no-checksum decision as founded upon
Apple's hardware-and-software strategy. Given their situation, they can decide
that checksumming is a hardware problem. They can build their latest devices
with ECC, and require customers to only use Apple-developed USB keys. (I would
not be shocked to learn that, in late 2017, Apple starts selling USB-C storage
keys preformatted as APFS.)

~~~
TillE
Checksumming at that level is a bit pointless, because then you can't repair
the data. Instead of being able to recover from a mirror or parity data, all
you get is "it's corrupt, oh well".

~~~
ghshephard
Checksumming at the file system level solves the problem of corruption that
occurs off the media (on the bus). The media has checksumming on it that
allows detection and recovery from errors that occur on the media itself.

The question I'm sure Apple engineers have were - How often do we see BitRot
occurring off media, and is the media that we're deploying sufficiently
resistant to bit-rot?

And, with APFS's flexible structure, this is a feature that can be added at a
later time. Probably made sense to deliver in 2017 something that was rock
solid that they could build on, than to either (A) push out the delivery date,
or (B) not fully bake all features of the file system.

~~~
greglindahl
And here's a nice summary of the bitrot that you see in systems with disks.
Just like network devices, anyone with enough gear sees these failures.

[https://www.usenix.org/legacy/event/fast08/tech/full_papers/...](https://www.usenix.org/legacy/event/fast08/tech/full_papers/krioukov/krioukov_html/main.html)

------
rayiner
Fun fact: Dominic Giampalo (who wrote the BeOS file system) is on the APFS
team. His book "Practical File System Design" is an excellent description of a
traditional UNIX file system design. May be out of print now but I think used
copies turn up on Amazon.

~~~
gherkin0
> Fun fact: Dominic Giampalo (who wrote the BeOS file system) is on the APFS
> team. His book "Practical File System Design" is an excellent description of
> a traditional UNIX file system design. May be out of print now but I think
> used copies turn up on Amazon.

It looks like he has a PDF up on his website:

[http://www.nobius.org/~dbg/practical-file-system-
design.pdf](http://www.nobius.org/~dbg/practical-file-system-design.pdf)

~~~
greglindahl
He's had it up at that url since 2004, with a major update in 2008... says the
Wayback Machine.

------
throwaway2626
APFS is slated for a 2017 release, yet development started as recently as
2014. By comparison, development on Btrfs started all the way back in 2007,
yet many still consider it to be unsuitable for widespread deployment,
particularly in mission-critical settings.

If Apple can actually pull off this turnover so quickly, does that suggest the
complaints about Apple's declining software engineering quality were
overblown?

Edit: Ted T'so in this talk(1) (at the 8 min mark) discusses the taskforce
that birthed Ext4 and Btrfs and its estimate (based on Sun's experience with
ZFS, Digital's with AdvFS) of 5-7 years at a minimum for a new file system to
be enterprise ready--an estimate which definitely proved optimistic with
regard to Btrfs. Will APFS be different?

1:
[https://www.youtube.com/watch?v=2mYDFr5T4tY](https://www.youtube.com/watch?v=2mYDFr5T4tY)

~~~
cyphar
[Disclosure: I work for SUSE]

> an estimate which definitely proved optimistic with regard to Btrfs.

SUSE has had Btrfs support for enterprises since SLE11SP2[1,2] (2012). And
it's been the default for the root partition since SLE12[3] (2014). So it
wasn't overly optimistic at all, it was actually a very accurate estimate. The
same support apples for openSUSE, but I think they had it for longer.

[1] [https://www.linux.com/news/snapper-suses-ultimate-btrfs-
snap...](https://www.linux.com/news/snapper-suses-ultimate-btrfs-snapshot-
manager) [2] [https://www.suse.com/communities/blog/introduction-system-
ro...](https://www.suse.com/communities/blog/introduction-system-rollbacks-
btrfs-and-snapper-sles-11-sp2/) [3]
[https://www.suse.com/documentation/sles-12/stor_admin/index....](https://www.suse.com/documentation/sles-12/stor_admin/index.html?page=/documentation/sles-12/stor_admin/data/sec_filesystems_major.html)

~~~
viraptor
Do you have any good source to point people at when they say btrfs is unstable
and corrupts data? (and I mean, either positive or negative data, I don't want
to be biased either way) I'm kind of tired of people posting comments like
that, which are based on "google is full of people saying this", or "I know
one person who lost data" (but no idea which kernel was it - could well be
from 2014).

If I look at btrfs patches on lkml, on one hand side I can see some fixes for
data loss, but on the other they're usually close to "if you change the
superblock while log is zeroed and new superblock is already committed and
there's a power loss exactly at the point, you'll get corruption" \- which are
just really obscure edge cases people are unlikely to ever hit.

So what can I look at to get a realistic picture of what's going on? (what
would SUSE point me at)

(for the negative results, I know of the recent filesystem fuzzing
presentation where btrfs comes out worst, but honestly I don't consider it
interesting for real world usage - car analogy, I'm interested how the car
behaves on a typical road, not which fizzy drinks added to the gas tank will
break it)

~~~
cmurf
RE: the fs fuzzing, David Sterba a Btrfs developer @ Suse replied to this:
[https://www.spinics.net/lists/linux-
btrfs/msg54454.html](https://www.spinics.net/lists/linux-btrfs/msg54454.html)

As for unstable and corrupts data, this is just not true. Many more users
using it on stable hardware don't have problems. I've used Btrfs single,
raid0, raid10 and raid1 for years, and haven't had corruptions at all ever
that I didn't myself induce.

I have stumbled upon, just days ago, parity corruptions in raid5 however. The
raid56 stuff is much much newer and hasn't been as well tested, and has been
regarded as definitely not ready for prime time. So that's a bug, and it'll
get fixed.

Bunches of problems happen on mdadm and LVM raid also due to drive SCT ERC
being unset or unsupported, resulting in bad sector recovery times that are
longer than the kernel will tolerate. That results in SATA link resets, rather
than the fs being informed what sector is bad so it can recover from a copy
and fix up the bad sector.

So there are bugs all over the place, it's just the way it goes and things are
getting quite a bit better.

It is totally true that Apple can produce their own hardware that doesn't do
things like lie to the fs about FUA or its equivalent of req_flush being
complete when it's not, or otherwise violating the order of writes the fs is
expecting in order to be crash tolerant. But we're kinda seeing Apple go back
to the old Mac days where you bought only Apple memory and drives, 3rd party
stuff just didn't happen then. The memory is now soldered on the board and it
looks like the next generation of NVMe and storage technologies may be the
same.

Windows and Linux will by necessity have file systems that are more general
purpose than Apple's.

------
e1ven
You might also find value in the comments from when this was published via
Adam Leventhal's blog -

[https://news.ycombinator.com/item?id=11934457](https://news.ycombinator.com/item?id=11934457)

------
louwrentius
What I found most interesting about the review is that Apple chose not to
implement file data checksumming for the reason that the underlying hardware
is very 'safe' and already employs ECC anyway.

~~~
brigade
Serious question: what value does filesystem checksumming offer for the
average user who has no redundancy?

I mean, it'll tell you that your only copy of a file got corrupted, but it'll
still be corrupted...

~~~
radiowave
The average user might have no redundancy, but they still ought to have a
backup. Checksum failure tells them they need to restore.

At the very least, a checksum failure might tell them (or the tech they're
consulting) that they have a data problem, rather than, say, an application
compatibility problem.

~~~
XorNot
"Why is my machine crashing?" "Well, somelib.so is reporting checksum
failures" is a much better experience then "weird, this machine used to be
great but now it crashes all the time"

~~~
kartickv
somelib.so what? And what's a "checksum"? Error messages need to be
comprehensible to the average user.

~~~
brokenmachine
"Error: Buy a new Mac."

~~~
kartickv
Assuming your intent is not to troll: "The file xyz.txt is corrupt. Click here
to restore from a Time Machine backup."

------
mietek
The author is apparently proud of the fact that they have “literally never
seen or heard of [the OS X document revision system] until researching this
post”. The dismissive tone of the entire article is hard to stomach. Ars has
really gone downhill since the departure of John Siracusa.

~~~
ebbv
John Siracusa's departure was after Ars already had gone downhill
significantly. Also he was not an editor or regular contributor outside of his
OS X reviews, so his departure doesn't have much effect on the day to day.
It's more of a symptom rather than a cause.

The cause is the change in editorial standards for the site which has been
decreasing for years, and so every year is at a new all time low. Not to
mention the really invasive ads and even sponsored content.

~~~
acdha
People have been complaining about the editorial standards continuously since
at least 2000, but that doesn't magically lend more weight than any other
subjective assessment unsupported by evidence.

As for advertising, how do you expect them to hire writers and editors or run
a high-traffic website without advertising? They've offered paid subscriptions
for years and subscribers don't see ads at all but not enough people have
taken them up on it.

~~~
ebbv
Give me a break, I am not complaining about advertising period. Ars has always
had advertising and it used to be fine. But over time the ads have gotten more
and more intrusive. There's more and more auto play ads with sound, and ads
that take over part of the screen, etc.

The lack of subscriptions are probably a symptom of the editorial quality
going down. You can say there's no evidence of it, but when lots of people
complain about it that _is_ evidence of it. I was a long time Ars reader and
the b.s. finally got too thick that I have stopped going to the site regularly
in the last year.

It used to be a site for intelligent, balanced articles about tech. Now it's
got shills like DrPizza who basically just reprint whatever Microsoft's PR
department emails him.

------
Sanddancer
One of the (many) things that struck me in the article was his dismissive
remarks regarding copy on write within a filesystem. That seems like a
fantastically useful feature for development, etc, where you are almost always
building, then copying the artifacts into the deploy/test directory. Avoiding
the disk IO in those situations seems like a pretty sound win to me in terms
of giving a performance boost for free.

~~~
erichocean
> _was his dismissive remarks regarding copy on write within a filesystem_

? ZFS is copy-on-write.

~~~
Sanddancer
Yeah, I know ZFS is CoW, I use snapshots all the time when building
jails/containers precisely for that reason. I was just struck by a few
paragraphs from the article that really stuck out:

> With APFS, if you copy a file within the same file system (or possibly the
> same container; more on this later), no data is actually duplicated.
> Instead, a constant amount of metadata is updated and the on-disk data is
> shared. Changes to either copy cause new space to be allocated (this is
> called "copy on write," or COW). btrfs also supports this and calls the
> feature "reflinks"—link by reference.

> I haven't seen this offered in other file systems (btrfs excepted), and it
> clearly makes for a good demo, but it got me wondering about the use case.
> Copying files between devices (e.g. to a USB stick for sharing) still takes
> time proportional to the amount of data copied of course. Why would I want
> to copy a file locally? The common case I could think of is the layman's
> version control: "thesis," "thesis-backup," "thesis-old," "thesis-saving
> because I'm making edits while drunk."

CoW is one of those features that is superficially questionable until you
start noticing the bits and pieces of workflow it really makes faster and
easier. Given the keynote was towards an audience of developers, I'm really
surprised that there wasn't a demo showing how much faster deploys, etc are
with such tech.

~~~
ysleepy
He is dismissive of file-level CoW. ZFS will also create new blocks if you cp
a file, only dedup will remedy that. Also APFS only seems to do file-level CoW
with a special syscall.

~~~
sho_hn
Note he was also not even aware of support for the same feature in btrfs until
I pointed it out to him in an earlier HN discussion of his original blog post
-- the version on Ars already pretends otherwise, without being marked as a
fixup (though I am ready to blame this on Ars fully; his blog version includes
an UPDATE tag and has better flow).

------
forgotpwtomain
> APFS addresses this with I/O QoS (quality of service) to prioritize accesses
> that are immediately visible to the user over background activity that
> doesn't have the same time-constraints. This is inarguably a benefit to
> users and a sophisticated file system capability.

Could someone clear up how this is can be determined on the filesystem rather
than scheduler level (I suspect it cannot be, or the article is making bogus
claims)?

~~~
astrange
Why couldn't they be? Those are in the same process and can talk to each
other.

HFS+ implements scheduling of background QoS threads, especially on HDDs - you
can see it working with 'spindump'.

------
shmerl
Do they explain the reason why they needed a new filesystem instead of using
some existing one (same OpenZFS)?

~~~
cm3
We can speculate on the management decision, but from an engineering point of
view Dominic Giampaolo said that he didn't dive into btrfs, ZFS, or HAMMER in
order to not get potentially bad influence. At least that's how I read his
answer as cited by Adam. It's interesting that Larry (Oracle CEO) being a good
friend of the late Steve Jobs didn't result in finding an acceptable ZFS
licensing agreement for Apple. I mean, they've incorporated DTrace (into the
kernel of all things), so why not ZFS as well?

~~~
shmerl
_> from an engineering point of view Dominic Giampaolo said that he didn't
dive into btrfs, ZFS, or HAMMER in order to not get potentially bad
influence._

I'm not sure I quite understand this approach. It's sounds like a pure NIH
kind of method (which is admittedly common for Apple). I.e. of course one can
always try to reinvent the wheel, but why is it bad first to analyze what
already exists and to evaluate good / bad sides of that? Or his approach is
simply always to make everything from scratch and not to look at anything
else?

~~~
desdiv
>why is it bad first to analyze what already exists and to evaluate good / bad
sides of that?

It's a legal defense strategy.

ZFS is covered by multiple patents.

If someone who have never read about any of ZFS's designs and patents
independently reproduces one of ZFS's patented features, then the courts could
rule that that was non-infringement.

That's why the author said of Giampaolo: "...but didn't delve too deeply for
fear, he said, of tainting himself". Reading too much about a patented product
effectively taints yourself from being able to freely create your own designs.

~~~
prutschman
Independent reinvention is not in general a defense to a claim of patent
infringement. Is there some fairly specific set of circumstances you're
referring to here?

~~~
antod
No, but it helps you to avoid the potential treble damages bit for wilful
infringement.

------
chiph
Because Apple systems are non-expandable, I usually store my large data on
external drives (USB, NAS, etc). Any idea if APFS will work on them, either
direct connect or iSCSI?

~~~
raverbashing
File system is usually independent from storage media, with some exceptions

(and I would kind of avoid formatting external drives in "funny" FSs as I
might want to read them in some other OSs. Unfortunately this usually means
FAT32)

But I do have external media as HFS to work with Time Machine

~~~
laumars
For what it's worth, there's NTFS drivers for Linux and OS X (ntfs-3g). NTFS
often gets a mixed reception, but regardless of your opinion of it, it's still
massively better than FAT32.

Another option is ext3. The only caveat there is the Windows ext drivers don't
fully support ext3 (unless I've missed an announcement). However they do fully
support ext2 and ext3 is backwards compatible so you can effective get ext3
support in Windows.

Sadly though, all the good stuff requires 3rd party libraries. It's a real
pity everyone can't agree on a standard to replace FAT32. :(

~~~
ams6110
Is NTFS support good enough for writing? I haven't kept up. At one time you
could read NTFS on Linux well enough, but writing was not supported (or was
"experimental") I think mostly because the permissions attributes were quite
different from the Unix approach? And also because it was reverse-engineered
and not officially endorsed by MS?

Did Microsoft ever open-source the NTFS specs and drivers?

~~~
takeda
This was true 15 years ago, I hope it is not true anymore.

I know there was a way to write to ntfs from Linux, but it required to install
ntfs driver file from Windows.

I hope there is a native ntfs support that supports writing.

~~~
cyphar
> I know there was a way to write to ntfs from Linux, but it required to
> install ntfs driver file from Windows.

That isn't true. You need to use ntfs-3g, which is a free software
implementation of NTFS (that allows both reading and writing). It's been
stable for 10 years. Using NTFS doesn't require anything from windows and
doesn't require proprietary software.

~~~
takeda
It is true.

I stated how it was in the past[1]. Anyway according to [1], it looks like
ntfs-3g still uses a proprietary version of ntfs.sys

[1] [http://superuser.com/questions/139452/kernel-ntfs-driver-
vs-...](http://superuser.com/questions/139452/kernel-ntfs-driver-vs-
ntfs-3g/357949#357949)

~~~
cyphar
I just downloaded the source for ntfs-3g[1], and it doesn't appear to have any
binary blobs in it. In addition, it's under the GPLv2 so integrating
proprietary components is unlikely to be legal. The answer you linked quite
clearly says that the company _offers_ a proprietary _version_ of ntfs-3g. The
answer does _not_ say that ntfs-3g is proprietary.

Also, Trisquel (an FSF-approved GNU/Linux distribtion, meaning that it doesn't
have any proprietary software within 100km of the distro) has packages for
ntfs-3g[2]. So it's _definitely_ entirely free software.

So again, you're wrong on this point. In addition, I strongly believe that you
were never correct on this point. Maybe you confused ntfs-3g with the
proprietary version that company sells?

[1]: [http://www.tuxera.com/community/open-source-
ntfs-3g/](http://www.tuxera.com/community/open-source-ntfs-3g/) [2]:
[http://packages.trisquel.info/search?keywords=ntfs&searchon=...](http://packages.trisquel.info/search?keywords=ntfs&searchon=names&suite=belenos&section=all)

------
pohl
Great article!

 _Apple contrasts [space sharing] with the static allocation of disk space to
support multiple HFS+ instances, which seems both specious and an uncommon use
case._

Really? I depend on this use case every year to safely test drive pre-release
OS versions.

~~~
ahl
That use of multiple partitions I think qualifies as uncommon which is not to
suggest that it's not useful.

~~~
stephencanon
Extremely commonly used by engineers within Apple, however =)

------
Etheryte
The original article was on HN's front page when it was originally written.

------
4ad
> A ZFS developer’s analysis of the good and bad in Apple’s new APFS file
> system

What about linking to the actual source rather than a 3rd party:
[http://dtrace.org/blogs/ahl/2016/06/19/apfs-
part1/](http://dtrace.org/blogs/ahl/2016/06/19/apfs-part1/)

~~~
cm3
Well, the article on Ars seems to be authored by Adam, so it's probably a
cleaned up single piece of his blog posts.

~~~
lee_ars
Correct—I reached out to Adam to ask if Ars could syndicate the piece, and
then I did some minor cleanup and clarification editing on it (mostly style
conformance, but also some minor grammar tweaks and a few sentence re-writes).
After checking with him to make sure my changes didn't change anything
substantive, we ran the piece this AM.

Link wherever you'd like, of course, but the more traffic this pulls in, the
more ammo I have to be able to get Adam contributing to Ars as a regular
freelancer!

(edit - hi, adam!)

(edit^2 - corrections corrected. Apologies for the errors. I am just a simple
caveman. Your mathematics confuse and annoy me!)

~~~
cm3
If you want to publish more ZFS articles, I'd love to read about following
topics from Adam and/or members of OpenZFS:

\- using DTrace for ZFS operations insight

\- state of OpenZFS, comparing illumos, FreeBSD, NetBSD and Linux

\- myths and/or often cited problems (ECC, COW unfit for DB load, VM images
etc.)

\- ZFS version and feature flags wrt portability between illumos and FreeBSD
versions

\- pool flexibility work that will make it easier to remove devices

\- comparison to btrfs, hammer{1,2} and flash filesystems

\- the topic of zfs pointer rewrite

\- garbage collection and safely removing traces of a file/directory in a COW
fs

\- rebalancing story compared to HAMMER and future work in this space

\- built-in ZFS encryption (independent of system crypto volume support)

I should say that I'd only support an article like that if Ars allows parts of
the written text to be incorporated into OpenZFS wiki/documenation.

~~~
lee_ars
Some of these are probably a little too deep to get much traction on the ars
front page, but there are some solid ideas here (especially the oft-sited
problems one). Thanks for the feedback!

We _did_ run a big piece by Jim Salter a couple of years ago on next-gen file
systems that focused on ZFS and btrfs ([http://arstechnica.com/information-
technology/2014/01/bitrot...](http://arstechnica.com/information-
technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/)), but
yeah, I'd love to have more filesystem-level stuff showing up. The response is
generally very, very strong—turns out people really like reading about file
systems when the authors know what they're talking about!

edit -

> I should say that I'd only support an article like that if Ars allows parts
> of the written text to be incorporated into OpenZFS wiki/documenation.

That's more complicated, unfortunately. I am not a lawyer etc etc and I am
only speaking generally here, but Ars and CN own the copyright on the pieces
we run (though syndications like Adam's piece today are different), and
wholesale reuse of the text without remuneration isn't something that the CN
rights management people like. Fair use is obviously fine, so quoting portions
of pieces as sources in documentation is not a problem, but re-using most or
all of something isn't (necessarily or usually) fair use.

(again, not a lawyer, my words aren't gospel, don't take my word for it, etc
etc)

~~~
cm3
I'm also not a lawyer, but my thought process is like this: in the open source
spirit, given this is not a book to be profited from, and profiting from
technical books is very hard anyway, developers of some software could
contribute technical content which then instead of compensation would only get
editor time and in return be allowed to be included in the project's
documentation. Real World Haskell, Real World OCaml somehow managed to
convince the publisher this is fine. Again, IANAL, just thinking out loud.

