While I'm no ZFS expert, I've been using it for several years now and my understanding is this. Take what a normal fsck type tool does and build those features into the underlying FS and supporting toolchain. For what ZFS does and how it works, it really doesn't make sense to me at all for it to have an "fsck," whatever that means. Really, it's hard to even imagine what an "fsck" would do for zfs. You'd just end up rewriting bits of the toolchain or asking for the impossible.
I asked this in the other thread, but I'll ask here again. Excluding semantics, what is it that people want fsck to do specifically that zfs doesn't provide a method for already? Seriously, the question to me seems akin to asking why manufacturers don't publish the rpm spec for SSDs. It's a really odd thing to ask and can't be answered without an exhaustive review of the mechanics of the system.
I can't help but get the feeling that a lot of people complaining about ZFS have very little knowledge or familiarity with it and/or BSD/Unix in general. ZFS is not like any Linux FS. It doesn't use fstab, the toolchain is totally different, the FS is fundamentally different. It was built for Solaris and really reflects their ideology, which is completely foreign to people who only have familiarity with Linux. Accept it and move on or don't, but I've yet to see any evidence to back up these claims other than "this is what is done in Linux for everything else" which is just FUD.
I know you can (and we have) do the copy back and forth method to rebalance how things are organized on multiple vdevs inside one pool, but it would be nice if that were not a manual process, or could be optional on a scrub.
People can and do run their pools up higher than 80% utilization. It happens. It's happened to you. There should be a non-surgical way to regain balanced vdevs after such a state...
I meant to attend zfs day and will try to come in 2013 if it is held.
 Space accounting. How much uncompressed space, minus ZFS metadata, does that ZFS filesystem actually take up ? Nobody knows.
 extattrs + busy ZFS == crash
I know it's a bit off topic but could you elaborate on the issues you see with BTRFS?
- Online scrubbing rather than fsck, so there is little/no downtime if the filesystem were interrupted, i.e., datacenter power failure. fsck/raid rebuild of large filesystems can mean lengthy outages for users.
- "Always" consistent: Start writing the data of a transaction to unallocated space (or ZIL) and update metadata last.
- Greatly configurable block device layer:
* RAIDZ, RAIDZ2, RAIDZ3, mirror, concat ...
* ZIL (fs journal), L2ARC (cache) can be placed on different media or even combinations of media.
- Send & receive snapshots across the network.
From what I understand ZFS will never be supported for RHEL (and that's mostly what I work with) so I'm hoping for the best with BTRFS.
So I guess "alias fsck='zpool scrub tank'" would quail everyone's concerns.
Hence if anyone asks you whether ZFS has fsck it's easier to say yes it does and it's called scrubbing.
As far as the Linux/Unix thing, I think you're reading way too far into it. Linux neither invented nor popularized filesystem consistency checking programs. Unix filesystems (such as UFS) often have consistency checking programs as well, though they might not be called "fsck". Windows has Scandisk. ZFS is the odd one out here, and it's not surprising for people to treat it as such. Give it time, and if ZFS's approach becomes more widespread, people will come around.
tl;dr: It's about trust, not ignorance.
I brought up UFS as people who just use linux often similarly complain about the way BSD does "partitioning." The two debates seem very similar to me.
In my case, our setup died due to some power problems, and somehow, NULL pointers got written to disk with valid checksums. Normally this wouldn't happen, but when it does, it's a PITA to debug because trying to traverse/read the disk gives a kernel fault, instead of a segfault as you might see in a user-level fsck program. This was a real pain, as we had encrypted disks, and every reboot meant going through the disk attach steps (enter password, etc etc), every time.
As a result, a userspace implementation of scrubbing would be useful since in this sort of rare instance, I'd be able to probe the fsck process with a good debugger and see why it's crashing. Since it's in userspace, the fsck program can also quit more sanely, with a full report on where it found the corruption. I was able to get my data back via some ad-hoc patches, but it was an... interesting experience having to debug in kernel vs in userspace.
zdb isn't a substitute for most of these things as the on-disk compression and RAIDZ sharding makes it difficult to actually see the raw data structures. Max Bruning wrote a post a while back with an on-disk data walkthrough, where he wrote some patches to fix this, but they haven't made their way upstream yet. Additionally, FreeBSD and Linux don't have mdb. :(
zdb now does decompression, though slightly differently from what I implemented.
Syntax is: zdb -R poolname vdev:offset:size:d
The "d" at the end says to decompress. zdb tries different decompression algorithms until it finds one that
As for my mdb changes, I really think mdb should be able to pick up kernel ctf info so that it can print data structures on disk. That I could probably get working on illumos fairly easily.
My method used zdb to get the data uncompressed, then
used mdb to print it out ala ::print. I actually think something like "offset::zprint [decompression] type" in mdb is the way to go. It would mean no need for zdb, which usually gives too much or not enough, and is not interactive (hence, not really a good debugger as far as I'm concerned). Better would be:
# mdb -z poolname
20000::walk uberblock | ::print -t uberblock_t
And from there, something like:
offset::zprint lzjb objset_phys_t
where offset comes from a DVA displayed in the uberblock_t.
Some people seem to get my idea and think it's good. Others either don't get it, or don't care.
Someone like Delphix might really like it.
Just my 2 cents.
Unlike UFS or FFS or EXTn the file system couldn't be corrupted by loss of power mid write, but like ZFS it can be corrupted by bugs in the code which write a corrupted version to disk. So the tool does something similar to fsck but it is simpler, more of a data structure check rather than a "recreate the flow of buffers through the buffer cache to save as much as possible" exercise.
Worked at a company that bought a ton of NetApp filers to support a webmail service. NetApp sales engineers swear on their mothers' graves that there is no such thing as fsck for wafl, that wafl always transitions from one gold-plated consistent state to the next with no possibility of metadata inconsistency. OK.
Three months later, big outage. On-site techs report the filers display "fast wack" on the front panel. Call NetApp support. What is "fast wack"? That's the fsck. Assholes!
It turned out that the filer had got corrupt somehow, and wack itself could not comprehend a filesystem with more than 2 billion files. Inode number stored in signed int32. Major, major surgery, hotpatching of filer firmware, three days of downtime, serious negative press coverage.
Bottom line: whenever anyone tells you their filesystem is guaranteed to be consistent, kick that person right in the shins.
How does it manage to stay consistent if a cosmic ray strikes it and flips one or more bits?
How does it manage to stay consistent if you physically bump in to the drives and cause physical damage by having the disk head briefly touch the disk surface?
Wouldn't you need a filesystem consistency check and repair tool like fsck in these cases?
At the time (and I think its still true) cosmic rays do not have sufficient energy to flip a magnetic domain on disk. Memory bit flips are detected by ECC and channel (between the I/O card and memory and/or disk) are identified with CRC codes.
"How does it manage to stay consistent if you physically bump in to the drives and cause physical damage by having the disk head briefly touch the disk surface?"
The disks are part of a RAID4 or 6 group (RAID 6 preferred for drives > 500MB, required for drives >= 2TB) so physically damaging a drive results in a group reconstruction of the data on that drive.
NetApp has always had a pretty solid "don't trust anything" sort of mantra that has been tested and fortified a few times by various events. The ones I got to see first hand were an HBA that corrupted traffic through it in flight, drives that returned a different block than you asked for, and drives that acknowledged they had written data to the drive when in fact they had not.
Back in the early 2000's anything that could happen with a disk with a probability larger than once in billion operations or higher, they got to see once a month. It was an interesting challenge which requires a certain discipline to deal with. When I went to Google and saw their "we assume everything is crap, we just fix it in software" model it gave me another perspective on how to tackle the problem of storage reliability.
Both schemes work and have their plusses and minuses.
2. When there is a bug in the on-disk-state it should be addressed by the code that reads the data , not by a fsck tool.
2.1. The correction of the bug in the on-disk-state should be done on the basis of the exact knowledge about the bug and not by a generic check tool.
3. Repair is always based on assumptions. Those could be correct or incorrect. The more you know about the problem that led to the repair-worthy state, the more probable the assumptions are correct.
4. What is the reasoning behind the argument "when your metadata is corrupt , that the data is correct" and so you could repair metadata corruption without problems. It sounds more sensible to fall back to the last known correct and consistent state of metadata and data, based on the on-disk-state represented by the pointer structure of the ueberblock with the highest transaction group commit number with a correct checksum . The Transaction Group rollback at mount does exactly this.
It was a development machine, so it wasn't being backed up. I thought it was just one disk going bad; by the time it was clear that it was something worse than that, it was too late. Most of the important contents of the pool had been checked into the VCS, but not everything. I wound up grepping the raw disk devices to find the latest versions of a couple of files.
Any filesystem would have had serious trouble in such a situation, of course. But I can't help thinking that picking up the pieces might have been easier with, say, EXT3.
On the other hand, I think it speaks well for ZFS that a slowly failing PSU seems to be almost the only way to lose a pool.
zpool clear -F data
It appears that ZFS lacks a full consistency checker -- scrub only walks the tree and computes checksums; notably absent in this procedure appears to be validating the DDT. While ZFS claims to be always on-disk consistent--and I certainly believe that the intent is that it be so!--I seem to have tripped over some bug ( http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016... ) which corrupted the DDT, and now I have no way of rebuilding it, so I dropped $$$ (for me) on a new disk array and zfs send | zfs recv so that everything rebuilt. That's sort of crazy, if I may be so bold.
I suppose I could take the pool offline for several days and poke at it with zdb, but that is not really desirable either.
If you are happy that the ZFS code is perfect, then it makes sense to rely upon its consistency checks, snapshot features, etc (and I'm not criticising those). But what if ZFS isn't 100%? How do you recover your data?
That problem is solved by having backups. ZFS dis not a replacement for backups.
Oh, and there's always zdb, the ZFS debugger, you can use it to walk the on-disk structures.
Also, is "you can't rely on fsck to alleviate that problem" an argument for not having an fsck?
Is there a particular type of data corruption that fsck would recover that ZFS would not?
If your OS or disks are buggy, you are hosed, anyway, as that checking tool would run on the same OS and hardware.
You should look at ZFS as having a built-in fsck that is automatically invoked when needed.
I'm not too familiar with exactly how fsck works, but it seems to mainly stitch bad metadata back together so there's still no guarantee that your data is perfectly restored.
This would be especially true if a bad FS write operation caused the data to be corrupt.
If the data is really corrupt the FS doesn't have any inherent way of knowing what the correct data should be.
I'm also rather confused by Oracle contributing to btrfs while also building ZFS privately. My intuition is that if they open-sourced ZFS and offered it under a dual BSD/GPL license, it would become the fs standard overnight.
Anyway: You do not repair the state last state of the data. And in my opinion: You should not try to repair it ... at least not by automatic means. Such a repair would be risky in any case. [..] In this situation i would just take my money from the table and call it a day. You may lose the last few changes, but your tapes are older.
This "you do not need an emergency repair tool because in an emergency I think you should just forget it" is exactly the claim that this blog post was supposed to be countering. Explaining why a do-the-best-you-can repair utility is not necessary, and the argument it boils down to is "because I don't think you should do that".
And as i already wrote in a different comment: If there is a bug in the stuff writing the on-disk state, the bug should be addressed on the exact knowledge of the bug in the code reading the on-disk-state and thus doesn't make assumption what could be halfway correct, but by some piece of code that does the correct with the incorrect on-disk-state.
But the argument went:
Detractor: ZFS needs fsck.
You: No it doesn't.
Detractor: ZFS creators attitude has always been "we don't think it should exist", but there's no more reason than this. It still corrupts so it still needs an fsck tool.
You: Here is a big blog post about why it doesn't: OK so it can get corrupted but I don't think an fsck tool should exist.
You know how useful it is to post on StackExchange "help I have this situation, I know conceptually there is a way out, but how can I actually do it?" and get the replies "you shouldn't want to do that"? It's not helpful at all.
I think the situation is pretty much similar to the "shoot the messenger" problem of ZFS. Some people are annoyed that ZFS reports errors because of corruption and blocks access to the data (of course without having any redundancy). However the alternative would be reading incorrect data. What's worse. Knowing that you have to recover data or processing incorrect data without knowing it.
Performance is decent enough with lz4 compression and dedup off. Dedup on takes more CPU but nothing even the 2.2Ghz Turion can't handle. Main thing is stability has improved a lot too.
If you want the utmost performance may be this isn't for you but for NAS/backup/streaming type usage ZFS on Linux is nearly perfect.
ZFS doesn't need fsck because it has virtual log-structured metadata and can therefore always recover itself.
That kind of threw me for a loop.