
No, ZFS really doesn't need an fsck tool - antonios
http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html
======
deelowe
The whole fsck discussion seems baffling to me.

While I'm no ZFS expert, I've been using it for several years now and my
understanding is this. Take what a normal fsck type tool does and build those
features into the underlying FS and supporting toolchain. For what ZFS does
and how it works, it really doesn't make sense to me _at all_ for it to have
an "fsck," whatever that means. Really, it's hard to even imagine what an
"fsck" would do for zfs. You'd just end up rewriting bits of the toolchain or
asking for the impossible.

I asked this in the other thread, but I'll ask here again. Excluding
semantics, what is it that people want fsck to do specifically that zfs
doesn't provide a method for already? Seriously, the question to me seems akin
to asking why manufacturers don't publish the rpm spec for SSDs. It's a really
odd thing to ask and can't be answered without an exhaustive review of the
mechanics of the system.

I can't help but get the feeling that a lot of people complaining about ZFS
have very little knowledge or familiarity with it and/or BSD/Unix in general.
ZFS is not like any Linux FS. It doesn't use fstab, the toolchain is totally
different, the FS is fundamentally different. It was built for Solaris and
really reflects their ideology, which is completely foreign to people who only
have familiarity with Linux. Accept it and move on or don't, but I've yet to
see any evidence to back up these claims other than "this is what is done in
Linux for everything else" which is just FUD.

~~~
bcantrill
Yes, emphatically agreed. I co-founded Fishworks, the group within Sun that
shipped a ZFS-based appliance. The product was (and, I add with mixed emotion,
still is) very commercially successful, and we shipped hundreds of thousands
of spindles running ZFS in production, enterprise environments. And today I'm
at Joyent, where we run ZFS in production on tens of thousands of spindles and
support software customers running many tens of thousands more. Across all of
that experience -- which has (naturally) included plenty of pain -- I have
never needed or wanted anything resembling a traditional "fsck" for ZFS. Those
that decry ZFS's lack of a fsck simply don't understand (1) the semantics of
ZFS and specifically of pool import, (2) the ability to rollback transactions
and/or (3) the presence of zdb (which we've taken pains to document in
illumos[1], the repository of record for ZFS). So please, take it from someone
with a decade of production experience with ZFS: it does not need fsck.

[1] <http://illumos.org/man/1m/zdb>

~~~
rsync
Hi. rsync.net here. I wonder if you concur with us that ZFS _does_ need a
defrag utility ?

I know you can (and we have) do the copy back and forth method to rebalance
how things are organized on multiple vdevs inside one pool, but it would be
nice if that were not a manual process, or could be optional on a scrub.

Or something.

People can and do run their pools up higher than 80% utilization. It happens.
It's happened to you. There should be a non-surgical way to regain balanced
vdevs after such a state...

~~~
bcantrill
Ah, yes -- now we're talking about a meaningful way in which ZFS can be
improved! (But then, you know that.) Metaslab fragmentation is a very real
issue -- and (as you point out) when pools are driven up in terms of
utilization, that fragmentation can become acute. (Uday Vallamsetty at Delphix
has an excellent blog entry that explores this visually and
quantitatively.[1]) In terms of fixing it: ZFS co-inventor Matt Ahrens did
extensive prototyping work on block pointer rewrite, but the results were
mixed[2] -- and it was a casualty of the Oracle acquisition regardless. I
don't know if the answer is rewriting blocks or behaving better when the
system has become fragmented (or both), but this is surely a domain in which
ZFS can be improved. I would encourage anyone interested in taking a swing at
this problem to engage with Uday and others in the community -- and to attend
the next ZFS Day[3].

[1] <http://blog.delphix.com/uday/2013/02/19/78/>

[2] <http://permalink.gmane.org/gmane.os.illumos.devel/5203>

[3] <http://zfsday.com/zfsday/>

~~~
rsync
We'll see what we can do. We have some ZFS firefighting that dominates our to-
do list currently[1][2] but if we can work through that in short order we will
dedicate some funds and resources to the FreeBSD side of ZFS and getting
"defrag" in place.

I meant to attend zfs day and will try to come in 2013 if it is held.

[1] Space accounting. How much _uncompressed_ space, minus ZFS metadata, does
that ZFS filesystem actually take up ? Nobody knows.

[2] extattrs + busy ZFS == crash

------
ChuckMcM
Interesting rant (from 2009). At NetApp the WAFL file system is also always
consistent on disk, so it too doesn't need fsck. That said, WAFL had 'wack'
(WAfl ChecK) which could go through and check that the on disk image was
correct.

Unlike UFS or FFS or EXTn the file system couldn't be corrupted by loss of
power mid write, but like ZFS it can be corrupted by bugs in the code which
_write a corrupted version to disk_. So the tool does something similar to
fsck but it is simpler, more of a data structure check rather than a "recreate
the flow of buffers through the buffer cache to save as much as possible"
exercise.

~~~
gnosis
_"At NetApp the WAFL file system is also always consistent on disk, so it too
doesn't need fsck."_

How does it manage to stay consistent if a cosmic ray strikes it and flips one
or more bits?

How does it manage to stay consistent if you physically bump in to the drives
and cause physical damage by having the disk head briefly touch the disk
surface?

Wouldn't you need a filesystem consistency check and repair tool like fsck in
these cases?

~~~
ChuckMcM
_"How does it manage to stay consistent if a cosmic ray strikes it and flips
one or more bits?"_

At the time (and I think its still true) cosmic rays do not have sufficient
energy to flip a magnetic domain on disk. Memory bit flips are detected by ECC
and channel (between the I/O card and memory and/or disk) are identified with
CRC codes.

 _"How does it manage to stay consistent if you physically bump in to the
drives and cause physical damage by having the disk head briefly touch the
disk surface?"_

The disks are part of a RAID4 or 6 group (RAID 6 preferred for drives > 500MB,
required for drives >= 2TB) so physically damaging a drive results in a group
reconstruction of the data on that drive.

NetApp has always had a pretty solid "don't trust anything" sort of mantra
that has been tested and fortified a few times by various events. The ones I
got to see first hand were an HBA that corrupted traffic through it in flight,
drives that returned a different block than you asked for, and drives that
acknowledged they had written data to the drive when in fact they had not.

Back in the early 2000's anything that could happen with a disk with a
probability larger than once in billion operations or higher, they got to see
once a month. It was an interesting challenge which requires a certain
discipline to deal with. When I went to Google and saw their "we assume
everything is crap, we just fix it in software" model it gave me another
perspective on how to tackle the problem of storage reliability.

Both schemes work and have their plusses and minuses.

------
c0t0d0s0
1\. When there is a bug in the code that writes the ZFS stuff, why should the
bug be addressed by the fsck code. This would assume, that you know of the bug
beforehand, but then you could better fix the bug in the code that writes.

2\. When there is a bug in the on-disk-state it should be addressed by the
code that reads the data , not by a fsck tool.

2.1. The correction of the bug in the on-disk-state should be done on the
basis of the exact knowledge about the bug and not by a generic check tool.

3\. Repair is always based on assumptions. Those could be correct or
incorrect. The more you know about the problem that led to the repair-worthy
state, the more probable the assumptions are correct.

4\. What is the reasoning behind the argument "when your metadata is corrupt ,
that the data is correct" and so you could repair metadata corruption without
problems. It sounds more sensible to fall back to the last known correct and
consistent state of metadata and data, based on the on-disk-state represented
by the pointer structure of the ueberblock with the highest transaction group
commit number with a correct checksum . The Transaction Group rollback at
mount does exactly this.

------
ScottBurson
I lost a ZFS pool once. The cause ultimately turned out to be a slowly failing
PSU. (It was an expensive OCZ PSU, too, which is why I didn't suspect it as
quickly as I probably should have. OCZ did replace it under warranty without
argument.)

It was a development machine, so it wasn't being backed up. I thought it was
just one disk going bad; by the time it was clear that it was something worse
than that, it was too late. Most of the important contents of the pool had
been checked into the VCS, but not everything. I wound up grepping the raw
disk devices to find the latest versions of a couple of files.

Any filesystem would have had serious trouble in such a situation, of course.
But I can't help thinking that picking up the pieces might have been easier
with, say, EXT3.

On the other hand, I think it speaks well for ZFS that a slowly failing PSU
seems to be almost the only way to lose a pool.

------
ianlevesque
So if you have an unmountable zfs pool, instead of reaching for fsck (which
doesn't and won't exist) you can instead do:

    
    
        zpool clear -F data
    

And it will take advantage of the copy-on-write nature of ZFS to roll back to
the last verifiably consistent state of your filesystem. That's a lot better
than the uncertain fixes applied by fsck to other file systems. It even tells
you how old the restored state is.

~~~
cbr
Why not ship with fsck.zfs="zpool clear -F data"? Then people would stop
complaining.

------
nwf
A slight disagreement: the advantage of a ZFS online consistency checker would
be to help ensure that there are no bugs in ZFS.

It appears that ZFS lacks a full consistency checker -- scrub only walks the
tree and computes checksums; notably absent in this procedure appears to be
validating the DDT. While ZFS claims to be always on-disk consistent--and I
certainly believe that the intent is that it be so!--I seem to have tripped
over some bug ( [http://lists.freebsd.org/pipermail/freebsd-
fs/2013-March/016...](http://lists.freebsd.org/pipermail/freebsd-
fs/2013-March/016627.html) ) which corrupted the DDT, and now I have no way of
rebuilding it, so I dropped $$$ (for me) on a new disk array and zfs send |
zfs recv so that everything rebuilt. That's sort of crazy, if I may be so
bold.

I suppose I could take the pool offline for several days and poke at it with
zdb, but that is not really desirable either.

------
joosters
The article doesn't ever consider that _ZFS_ might have bugs. Dodgy disks, bad
firmware, power failures, yes. But no consideration that the ZFS code could
contain problems.

If you are happy that the ZFS code is perfect, then it makes sense to rely
upon its consistency checks, snapshot features, etc (and I'm not criticising
those). But what if ZFS isn't 100%? How do you recover your data?

~~~
4ad
_Any filesystem_ might have bugs and you can't rely on fsck to alleviate that
problem (in fact it might make it worse).

That problem is solved by having backups. ZFS dis not a replacement for
backups.

Oh, and there's always zdb, the ZFS debugger, you can use it to walk the on-
disk structures.

~~~
tjoff
The fsck can't make it worse if you mirror the drive first.

Also, is "you can't rely on fsck to alleviate that problem" an argument for
not having an fsck?

------
abtinf
ZFS may not need fsck, but it would be great if Oracle would re-open-source
it. I've considered using it, but I can't trust that it has a future.

I'm also rather confused by Oracle contributing to btrfs while also building
ZFS privately. My intuition is that if they open-sourced ZFS and offered it
under a dual BSD/GPL license, it would become the fs standard overnight.

~~~
antonios
It has a future. FreeBSD and companies based on it like IXSystems have adopted
it and developers are actively hacking it. In FreeBSD version 10 it will
feature TRIM support as well as a new data compression scheme (LZ4):
<https://wiki.freebsd.org/WhatsNew/FreeBSD10>

------
jodrellblank
And after a dozen paragraphs on how ZFS is unlikely to get corrupted, the meat
of the conent: "my opinion is that you shouldn't try to repair it anyway".

 _Anyway: You do not repair the state last state of the data. And in my
opinion: You should not try to repair it ... at least not by automatic means.
Such a repair would be risky in any case. [..] In this situation i would just
take my money from the table and call it a day. You may lose the last few
changes, but your tapes are older._

This "you do not need an emergency repair tool because in an emergency I think
you should just forget it" is exactly the claim that this blog post was
supposed to be countering. Explaining why a do-the-best-you-can repair utility
is not necessary, and the argument it boils down to is "because I don't think
you should do that".

~~~
c0t0d0s0
The basic problem of filesystem repair is the point that you repair the
metadata, but it can do nothing about your data. So when your filesystem
enables you to fall back to a known consistent state of metadata and data by
the COW, you should fall back to a known consistent state. And not to
something that is repaired by a generic tool. How do you know in such a
situation that somewhere in you thousands and thousands of files the repair
got something wrong. The copy on write behaviour of ZFS has it's advantages.

And as i already wrote in a different comment: If there is a bug in the stuff
writing the on-disk state, the bug should be addressed on the exact knowledge
of the bug in the code reading the on-disk-state and thus doesn't make
assumption what could be halfway correct, but by some piece of code that does
the correct with the incorrect on-disk-state.

~~~
jodrellblank
Yes, I know you said both those two things, and I agree - ZFS has inbuilt
error detection and healing. Any code that can detect errors and heal them
should be there. And if you have corruption the only long term, safe, ass-
covering advice to give is to restore to a pre-corrupt state.

But the argument went:

Detractor: ZFS needs fsck.

You: No it doesn't.

Detractor: ZFS creators attitude has always been "we don't think it should
exist", but there's no more reason than this. It still corrupts so it still
needs an fsck tool.

You: Here is a big blog post about why it doesn't: OK so it can get corrupted
but I don't think an fsck tool should exist.

You know how useful it is to post on StackExchange "help I have this
situation, I know conceptually there is a way out, but how can I actually do
it?" and get the replies "you shouldn't want to do that"? It's not helpful at
all.

~~~
c0t0d0s0
When i leverage the copyonwrite-ness (which is more redirect-on-write) of the
filesystem to recover from a defunct state of the on-disk state to a known
state, a filesystem check is just a suboptimal solution. That what i wanted to
express with the article. Of course - most filesystems are not COW and can't
use it, and thus the notion that a filesystem check is needed prevails. But at
the end a filesystem check is just about forcing a filesystem into a state of
metadata correctness, without caring much about the data. I wouldn't count
that as a way out, when there are better ways.

I think the situation is pretty much similar to the "shoot the messenger"
problem of ZFS. Some people are annoyed that ZFS reports errors because of
corruption and blocks access to the data (of course without having any
redundancy). However the alternative would be reading incorrect data. What's
worse. Knowing that you have to recover data or processing incorrect data
without knowing it.

------
jiggy2011
This reminds me of when I introduce people to Linux and they insist that there
must _really_ be a C drive.

------
blinkingled
Been running it since 0.6.0-rc14 on a Proliant Microserver with ECC RAM and I
am happy with it. 4x2TB RAIDZ internal, and 2x1TB USB3 zpools with SSD for zil
and logs. Shared over GigE using Samba4 and AFP.

Performance is decent enough with lz4 compression and dedup off. Dedup on
takes more CPU but nothing even the 2.2Ghz Turion can't handle. Main thing is
stability has improved a lot too.

If you want the utmost performance may be this isn't for you but for
NAS/backup/streaming type usage ZFS on Linux is nearly perfect.

------
spikels
Maybe I'm just tired this morning but I wish this article could just get to
the point. I feel like I'm reading a mystery novel but I am never going to
make it to the end and find out who did it.

~~~
4ad
This article explains _in-depth_ why ZFS doesn't need a fsck tool. It's very
much to the point in its entirety. I'm sorry content-free articles posted here
lately have diluted your expectations.

~~~
spikels
I agree this article in not content free but all these points could be made in
many fewer and less convoluted words. And they buried the real news (at least
to me) that the hardware is lying to the OS. Did not mean to offend.

~~~
andrewflnr
It's in the first sentence of the fourth paragraph.

------
DiabloD3
I don't understand why people think ZFS doens't have a fsck tool: zpool scrub

~~~
Karunamon
Probably because on at least two distros, fsck.zfs is just a symlink to a bash
script that runs "exit 0".

That kind of threw me for a loop.

~~~
DiabloD3
Thats to deal with the fact their init scripts blindly call "fsck.fstype", and
zfs doesn't have a fsck nor needs one.

------
brianlouisw
love the domain name

------
derleth
It reminds me of how people used to think all filesystems needed to be
explicitly defragmented because of design flaws in FAT, which was designed for
floppies (and wasn't especially well-designed even at that).

[http://geekblog.oneandoneis2.org/index.php/2006/08/17/why_do...](http://geekblog.oneandoneis2.org/index.php/2006/08/17/why_doesn_t_linux_need_defragmenting)

[http://www.howtogeek.com/115229/htg-explains-why-linux-
doesn...](http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-
defragmenting/)

------
PaulHoule
Nope, it just needs to stop having wrecks.

