
The Sorry State of CoW File Systems - louwrentius
http://louwrentius.com/the-sorry-state-of-cow-file-systems.html
======
gnoway
TL;DR:

1\. Author dislikes ZFS because you can't grow and shrink a pool however you
want.

2\. Author likes BTRFS because it implements more flexible grow/shrink, but
finds that in other ways it's unstable, and is unhappy that BTRFS doesn't let
you have arbitrary numbers of parity drives.

~~~
bsder
> 1\. Author dislikes ZFS because you can't grow and shrink a pool however you
> want.

I believe that this was on the future timeline for ZFS. It required something
like the ability to rewrite metadata or something.

The problem is that nobody really cares about this outside of a very few
individual users. Anybody enterprise just buys more disks or systems. Anybody
actually living in the cloud has to deal with entire systems/disks/etc.
falling over so ZFS isn't sufficiently distributed/fault tolerant.

So, you have to be an individual user, using ZFS, in a multiple drive
configuration to care. That's a _really_ narrow subset of people, and the
developers probably give that feature the time they think it deserves (ie.
none).

~~~
BrainInAJar
vdev removal is in the canonical source of illumos at this point:
[http://blog.delphix.com/alex/2015/01/15/openzfs-device-
remov...](http://blog.delphix.com/alex/2015/01/15/openzfs-device-removal/)

~~~
mahrens
ZFS device removal is on its way but not quite in illumos yet. From Alex's
blog post: "We’ll publish the code when we ship our next release, probably in
March [2015], [and we will] integrate into Illumos once we address all the
future work issues."

------
jethro_tell
One thing to note here, is that BTRFS use (Lack of use), is heavily based on
the way RAID1 is implemented. A BTRFS RAID1 doeesn't do dual disk striping, it
writes the file twice on different drives. You can do a RAID1 with 3 drives or
4 or 5. giving the same redundancy of RAID5 with a different name. If you want
to double the parity, I believe you would put dual RAID1s into another RAID1.

I think the functionality of BTRFS is there, though the ideas about how we
build redundant data sets will need to shift a bit.

~~~
louwrentius
I'm sorry but I don't think I follow you. Mirroring can never achieve the
space / redundancy efficiency of parity raid, I don't see how the above works.

~~~
hammerandtongs
As you said yourself, volume sizes aren't exactly the problem anymore ie there
is tons of space to store bits these days.

Raid rebuilds take way too long for modern volume sizes.

Raid rebuilds take so long that the likelihood of losing a second drive before
completion is very high.

The safety you generally want is redundant copies of checksummed data and
metadata blocks.

Btrfs allows you to choose the number of copies of data or metadata
separately.

Talking about RAID in the context of btrfs and zfs generally seems to confuse
people as it brings in the old expectations and understandings they worked so
hard at figuring out for the RAID era.

~~~
louwrentius
Rebuilds on ZFS and BTRFS are quite reasonable because most of the time only
the data itself is rebuild, not the entire drive as with old-fashioned
solutions.

Rebuilds depend on drive size, not array size.

Tripple-parity as part of ZFS allows you to create even larger VDEVS while
keeping the risks manageable. Interesting for low-performance archiving
solutions.

------
PaulHoule
When it comes to file systems, I like boring. What I don't like is systems
that have catastrophic unrecoverable wrecks because hypothetically a bit might
get flipped at the hardware level. Or that just stop moving entirely when you
run out of disk space, or whatever.

High-end storage systems support multiple hard drive controllers because
expensive hard drive controllers burn up like matches. Funny, I've never had a
cheap hard drive controller fails, but never throw out the box an expensive
hard drive controller came in because the R.M.A. is just a matter of time.

~~~
rodgerd
> What I don't like is systems that have catastrophic unrecoverable wrecks
> because hypothetically a bit might get flipped at the hardware level.

Perhaps they're hypothetical for you; that's nice. Having seen files (not
filesystems, just files) wrecked by things like "power supplies that are just
faulty enough to trash data but not faulty enough to stop booting the system",
not to mention dodgy drives and controllers, I'm quite happy to have
checksumming filesystems available.

~~~
johnplaynton
> Perhaps they're hypothetical for you; that's nice.

Err... I think he's in agreement with you actually.

~~~
rodgerd
He seems to me to be suggesting that the added functions of ZFS and btrfs work
are a net reliability loss that doesn't make up for the benefits.

~~~
PaulHoule
Here are a few points:

(1) From the perception of most people, mainstream filesystems such as ext4
and NTFS are pretty reliable. I've certainly had mainstream filesystems get
damaged but I've been able to repair them or copy data off without a lot of
trouble.

One reason mainstream filesystems are reliable is that they privilege
reliability over performance.

(2) In my experience, new file systems are dangerous. I've pretty frequently
experienced data corruptions within a few days of trying a new file system.
I've frequently been on projects that tried the latest thing, like the Linux
filesystem that was written by a murderer, and after experiencing problems,
we've gone back to mainstream file systems.

(3) New file system advocates believe filesystems and disks are unreliable, so
they're willing to tolerate a higher level of failures.

(4) Most terrifying, look at all the discussion on this thread about options
you can choose that might mitigate this problem or that problem. Every
configuration choice is a decision you can make wrong, is a reason why your
system can wreck in the middle of the night. If you know your stuff or hire
somebody who knows his stuff, maybe you'll get good choices, otherwise you are
playing Russian Roulette with Vladimir Putin.

------
jmbwell
Everything that matters about vdev size and configuration is explained here,
complete with charts that indicate the actual overhead of different
arrangements: [http://blog.delphix.com/matt/2014/06/06/zfs-stripe-
width/](http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/)

~~~
louwrentius
That's a very good article on that topic indeed. But in some way it's sad that
you even have to think about this with ZFS. It seems that this is not or way
less of an issue with BTRFS (or the magic unicorn file system that is still
not there).

~~~
mahrens
As the author of that article, thank you. And, I agree it would be sad if you
had to think about exactly how many drives are in a RAID-Z group. The entire
point of my blog post is that you do _not_ have to think about it.

"TL;DR: Choose a RAID-Z stripe width based on your IOPS needs and the amount
of space you are willing to devote to parity information. Trying to optimize
your RAID-Z stripe width based on exact numbers is irrelevant in nearly all
cases."

------
mrb
There is a very simple way to have a redundant ZFS pool that is expandable one
disk at a time: use ditto blocks, see
[https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing...](https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape)

\- start with 2 disks (not a mirror, but 2 independent vdevs)

\- set copies=2 to enable ditto blocks, causing data blocks to be
automatically stored on 2 different disks (ZFS always tries to store ditto
blocks on different vdevs when at least 2 are available)

\- when adding a 3rd disk, each data block will continue to be stored on 2
different disks, and you have the option to add a 4th disk, 5th disk, etc, any
time

The overhead is 50% (like RAID1/mirroring), so I presume this can be a
downside for the hobbyist who usually cares about dollar per TB. But
nonetheless this is an option.

I can see why hobbyists may want to expand a pool one disk at a time, but
personally I have been running ZFS for 10 years, I have a 20TB fileserver at
home (I grew it from 1TB over the years), and I have never needed to add just
one disk at a time. Usually when I run out of space, I replace all the disks
at once with larger ones (and/or replace the server if it is more than 4-6
years old).

Another point I wanted to comment on: the author makes the typical mistake of
assuming ZFS on a single disk is not very interesting ( _" A VDEV is either a
single disk (not so interesting)"_) but it is. To name a few features why it
is still great to run on a single-disk system: end-to-end checksumming, self-
healing, scrubbing, snapshots, clones, compression, ditto blocks,
deduplication, CoW, zfs send/recv, simple CLI tools, etc.

~~~
tw04
For the love of god don't spread that nonsense. Ditto blocks do NOT protect
against a drive failure. If you lose a drive you will lose all of your data.
It is intended as a belt and suspenders function and was put in place knowing
as we get larger and larger drives with higher and higher error rates, RAID
alone will likely not guarantee your data is protected.

[http://zfs-discuss.opensolaris.narkive.com/1aGWdqth/adding-n...](http://zfs-
discuss.opensolaris.narkive.com/1aGWdqth/adding-new-disks-and-ditto-block-
behaviour)

------
cmurf
Btrfs raid7 (3 parities) could be useful. But more parities really doesn't
scale very well. Raid10 does scale. So does GlusterFS.

A bigger scalability problem with Btrfs, is its raid10 implementation.
Conventional raid1+0 says you can lose more than one devices at a time so long
as you don't lose two mirror pairs. Btrfs raid10 doesn't have consistent
mirror pairings, so the mirrored chunks for device A1 aren't all on B1, they
will be distributed on multiple other drives thereby increasing the chances B1
chunks are lost in a 2+ device failure.

Developers plain n-way raid1 Very Soon Now™ and hopefully will get around to
better guarantees with raid10 to make it scale like raid10 should.

------
frik
I am still waiting for a database-based file system.

Some filesystems like NTFS/ReFS, BeFS and afaik ReiseFS4(?) and Btrfs(?))
could be extended. Microsoft extended the NTFS fs driver with the _Cairo_
project in the NT4/5 era:
[http://en.wikipedia.org/wiki/Cairo_(operating_system)](http://en.wikipedia.org/wiki/Cairo_\(operating_system\))

    
    
      All of the other functions NTFS exports for the EFS 
      driver begin with the prefix NtOfs, which presumably 
      stands for NT Object File System. One of the original 
      goals of the NT 5.0 project (code-named Cairo) was to 
      develop an object-oriented file system. Although NTFS in 
      Win2K probably hasn't reached the level of object 
      orientation that the Cairo planners had in mind, 
      Microsoft has extended NTFS in several significant ways 
      from its NT 4.0 implementation. One of those ways is 
      NTFS's support for encrypted files via the NtOfs 
      interfaces.

\-- [http://windowsitpro.com/systems-management/inside-
encrypting...](http://windowsitpro.com/systems-management/inside-encrypting-
file-system-part-1) , more info: [https://books.google.at/books?id=5f3mlM-
JKUAC&lpg=PA117&ots=...](https://books.google.at/books?id=5f3mlM-
JKUAC&lpg=PA117&ots=uRVy3Jqu1x&dq=http%3A%2F%2Fmicrosoft.com%2Fntserver%2Fcairomb.htm&pg=PA117#v=onepage&q=http://microsoft.com/ntserver/cairomb.htm&f=false)

Later Microsoft moved the database to user mode, the WinFS project, the MS SQL
db was stored in hidden directory on the NTFS filesystem:
[http://en.wikipedia.org/wiki/WinFS](http://en.wikipedia.org/wiki/WinFS) .
WinFS failed because of its over complicated Metadata-Ontology and it was too
slow for a filesystem (dotNet based shell extension). Microsoft moved the
ideas to "SharePoint" (2007-2016) that overs now some of the proposed
features:
[http://en.wikipedia.org/wiki/SharePoint](http://en.wikipedia.org/wiki/SharePoint)
.

BeFS includes support for extended file attributes (metadata), with indexing
and a querying language like a relational database:
[http://en.wikipedia.org/wiki/Be_File_System](http://en.wikipedia.org/wiki/Be_File_System),
BeFS book with all details: [http://www.nobius.org/~dbg/practical-file-system-
design.pdf](http://www.nobius.org/~dbg/practical-file-system-design.pdf)

The IBM AS/400 and iSeries had already a database-based filesystem in the
1980s:
[http://en.wikipedia.org/wiki/File_system#Database_file_syste...](http://en.wikipedia.org/wiki/File_system#Database_file_systems)

~~~
SigmundA
What is the difference between a file system and a database? From my point of
view a file system is just a specific kind of database (Hierarchical key
value).

Then we put other kinds of databases on top of the file system database. Funny
isn't it?

If the file system where a complete enough database we might not need things
like SQLite which purports itself as a replacement for fopen, hmm, full
circle.

Of course we have barely figured out how to encode text, maybe one day we will
know how to store and retrieve data flexibly and consistently, one day long
from now.

I personally like the relational model the best, it would be interesting to
see a OS and files system based on the relational model rather than the
hierarchical, but what do I know.

~~~
noblethrasher
The nice thing about computers (Turing Machines) is that they let us simulate
anything, including having better[1] computers (that may just run a little
slower). The big idea behind turning the file system into a "database" is that
you would be able to simulate any kind of file system you wanted, including
better ones. This is basically what RDBMSs like SQL Server and SQLite do.

I also like the relational model the best, but it's just one of the many
available simulations of "betterness".

[1] Where "better" means: easier to program, or more reliable, or has infinite
memory (garbage collection), or is easier to use, or...

------
oconnore
What on earth are you doing with a 71 TB personal filesystem?

~~~
protomyth
It is amazing once you start working with video how much space you can use
with RAW formats and all the files you create. A Red EPIC in HDRx at 8:1 at 5K
is almost 6GB/minute.

~~~
agumonkey
What amazes me is that not long ago this was Hollywood class specs.

[http://en.wikipedia.org/wiki/Monsters_vs._Aliens](http://en.wikipedia.org/wiki/Monsters_vs._Aliens)
required 120TB of data.

~~~
superuser2
RED is Hollywood-class, though towards the lower end. A lot of big-name movies
that you've heard of include scenes shot on RED cameras:
[http://www.red.com/shot-on-red](http://www.red.com/shot-on-red). The TV show
Leverage was shot entirely on a pair of RED Xs.

RED's stuff is on the order of $25k. The main player for digital cinema is
Arri ALEXA, on the order of $50k. It's not that far off.

~~~
agumonkey
Yeah I've read big names switching to RED (that's what made their business
thrive). I was also referring to the disk space. A single person can assemble
a 100TB setup. That was Pixar territory not long ago.

------
KaiserPro
whats missing here is the documentation.

ZFS has wonderful docs, and admin tools, BTRFS has terrible docs ans nasty
tools that handle like mdadm.

------
legulere
The saddest about both file systems are their licences.

The CDDL prevents ZFS from being included/heavily integrated into Linux, the
BSDs, OS X and Windows. Btrfs will very likely stay Linux only because of GPL.

The interesting and hard parts of a file system are pretty much operating
system independent, so I wish for a common file system initiative.

~~~
justincormack
"The CDDL prevents ZFS from being included/heavily integrated into Linux, the
BSDs, OS X and Windows" \- no, it is included and integrated into FreeBSD with
no issues, NetBSD has an old version that needs updating, OpenBSD thinks it is
too complex. OSX could use it (they could get a commercial license anyway),
but seemed to decide not to, as could Windows.

~~~
jethro_tell
I think the actual difficulty is patents. From what I remember, once Oracle
bought Sun, they decided to go after netapp for infringing on ZFS. As it
turned out, Oracle was infringing on NetApp's patent but NetApp had let it
slide because until Oracle started being a dick, it was mutually beneficial.
That is why Oracle also started BTRFS and is putting it's development behind
that instead.

I'm not really sure where that leaves other commercial projects (especially
OSX which is badly in need of a functional file system).

~~~
gnoway
It was really the other way around. NetApp had Sun in court over ZFS
infringing on WAFL patents[1] well before the buyout.

I certainly don't know all the ways people use their Macs, but it seems like
ZFS would not get a lot of use in OS X unless it was the only option:

\- It doesn't offer a whole lot over traditional filesystems in a single-
device context; since Apple has basically abandoned server/enterprise, I would
wager the vast majority of new OS X systems are single-drive.

\- At the time Apple was supposedly making a decision about this, I am not
sure if ZFS handled 4k alignment (ashift); this is very important now that a
lot of new Macs are shipping with SSDs.

\- You're still, even w/ 10.10, discouraged from using case-sensitive HFS+; at
least one application[2] won't work on a case-sensitive filesystem on OS X.

The above reasons probably apply to Windows as well. Also, in general, it
seems like both vendors want to be the sole source for all of their core
features.

[1] [http://www.zdnet.com/article/netapp-claims-suns-zfs-
violates...](http://www.zdnet.com/article/netapp-claims-suns-zfs-violates-its-
patents/)

[2]
[https://support.steampowered.com/kb_article.php?ref=8601-RYP...](https://support.steampowered.com/kb_article.php?ref=8601-RYPX-5789)

~~~
jethro_tell
From link [1] above it looks like Sun went after NetApp, after NetApp poked
around, it turned out that Sun was the one in violation. I remembered that was
the case but I was thinking that oracle had already bought them at that point

FTA:

>Sun approached NetApp about 18 months ago with claims the storage maker was
violating its patents and seeking a licensing agreement, NetApp Chief
Executive Dan Warmenhoven said in a statement.

>Several months into those discussions and following a review of the matter,
NetApp made a discovery of its own, Warmenhoven said, concluding NetApp did
not infringe the patents but that Sun infringed on NetApp's.

~~~
bcantrill
That's not at all what happened. (Disclaimer: I was at Sun at the time and was
deposed in the case.) Sun didn't "go after NetApp" \-- NetApp tried to buy
some StorageTek patents via a third-party intermediary, and when they were
rebuffed, they came after ZFS.[1] And, it should be said, NetApp didn't
particularly care about Sun -- they cared about the fact that ZFS was open
source. NetApp wanted Sun to "close" ZFS or otherwise "restrict its use"[2].
As for the case itself, it was moved back to California (NetApp had initiated
it in East Texas, the patent troll capital of the universe) where it became a
massive case, and was then slimmed down by order of the magistrate to three
patents on the NetApp side and four patents on the Sun countersuit side. At
the same time, thanks to a community outpouring of prior art, Sun was able to
pursue invalidating the claims of the NetApp patents with the US Patent
Office.[3] These efforts were wildly successful, and all three NetApp patents
were rejected on all claims. Amazingly, the case wasn't thrown out at that
point (though any damages would obviously be very limited), but every turn in
the case had gone Sun's way.

Then, Oracle acquired Sun, and for reasons that haven't been disclosed, Oracle
and NetApp dismissed their respective cases.[4] While I can't disclose the
reasons behind this, I can say that both Oracle and NetApp would have jumped
at the chance to cross-license ZFS and WAFL patents in a way that extended
only to Oracle and not to CDDL licensees. (That is, prohibited open source
ZFS.) Because the CDDL is airtight with respect to patents, such cross-
licensing was impossible, and by dismissing their suits (instead of settling),
the findings of fact from the trial essentially disappear -- which is
enormously to NetApp's advantage. Point is: ZFS actually has about as much
patent security as one can find in an open source system, as it has withstood
a direct, full-frontal assault by attorneys seeking to find a way around its
patent grants.

[1]
[https://web.archive.org/web/20070423001711/http://blogs.sun....](https://web.archive.org/web/20070423001711/http://blogs.sun.com/jonathan/entry/on_patent_trolling)

[2]
[https://web.archive.org/web/20080625023043/http://blogs.sun....](https://web.archive.org/web/20080625023043/http://blogs.sun.com/jonathan/entry/harvesting_from_a_troll)

[3]
[http://www.theregister.co.uk/2008/10/07/sun_gets_netapp_pate...](http://www.theregister.co.uk/2008/10/07/sun_gets_netapp_patent_invalidated/)

[4]
[http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_di...](http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/)

~~~
jethro_tell
Oh wow, thanks for the context and links.

