I often just search for videos by him and just listen to him present a topic or talk about anything. I learned about things normally I wouldn't go and read about otherwise -- history of Sun, ZFS, dtrace, containerization (Joyent, zones legacy, KVM, comparison with Docker).
In that respect he is like David Beazley (of Python fame). I could just listen to David talk all day and not get tired of it.
ZFS never went through a period where people lost data using it. ... amazing how people are willing to write that off like 'Oh well, it kind of lost my data but Omg! A million eyeballs'.
This is like a 12 year old who is stealing cars -- this is a serious issue. Not like "Well...he's only 12" it is more like "He stole a car at gunpoint, having just robbed a liquor store". Can't say let's just wait till he's 18. Would you want to know what he's like when he's 18?
Well in this particular case, seems like this feature was added in linux 2.6.2 (2004). epoll itself was added in linux 2.5.44 (2002), so yep, you're correct that they didn't make this right immediately. Now I think it is rather easy to play this game of pinpointing flaws this way and it could be done the other way around, probably in an equally as unconstructive matter (e.g: how long did freebsd wait for robust mutexes?).
That said, amazing speaker, thanks for the links.
The reason epoll has a stupid mode is that it was also designed to be a drop in replacement for select.
It's so frustrating to watch this. To see all these projects being developed on Linux where it's easier for others to simply reimplement the idea from scratch than try and remove all the Linux only code from them.
The Linux world only thinks about itself. In a spectacularly selfish way.
In the case of Btrfs it's a laudable effort, but has yet to approach production readiness in an awfully long time. Even if the licences are incompatible, the on-disk structures of ZFS could have been used to create a compatible implementation.
So I'm afraid I'm increasingly using other systems such as FreeBSD (with ZFS) and leaving Linux behind somewhat. Ultimately open systems, open standards and free software and software freedom are more important than a single implementation.
If Linux was a research experiment with exposure but little use compared to the other Unix systems, we would probably all look at it much more favorably, not because they are doing things right, but because they are willing to try crazy shit that we would all get to see the post-mortems for. The problem is that Linux is not some semi-obscure OS, so instead of these experiments being interesting reading for the weekend, we're often forced to become users of them, whether we like to or not.
I'd much rather this than the *BSD policies.
Oh and talking about FreeBSD, it's hardly bug free.
Look at these Golang issues and tell me it doesn't indicate something wrong in the FreeBSD kernel threads/fork/exec code.
I'd never use raid5/6 on anything. Hardware is cheap.
And how does this approach compare to other UNIXes? And I don't mean UNIXes from the 1970s, I mean ones which are currently in use: FreeBSD, OpenBSD, Solaris, MacOSX. Last I heard, they all had roughly equivalent init systems to systemd.
Frankly this is what should have happened with the likes of Debian regarding systemd as well. A parallel release offering systemd as the default, alongside sysv. Then it would be up to the admins to pick, rather than have to apply contortions during or right after install to switch from systemd to sysv.
Instead Debian adopted systemd, and we got Devuan as a sysv fork...
I've played with BTRFS (in RAID-1 mode) instead of ZFS because of the latter's limitations about adding drives. With BTRFS, you can drop in a new drive at any time and rebalance. This is how things ought to work. ZFS's concept of VDEVs might make sense for huge corporate storage arrays, but they're just a pain if you're running a small server or NAS, where growing the array drive-by-drive is a very common scenario.
Due to the complexity of dealing with VDEVs, I've personally never seen much use for ZFS on Linux vs. just biting the bullet and dealing with LVM over mdadm. BTRFS is the closest thing to a compelling argument to moving away from that setup I've seen on Linux.
EDIT: ... and neither are Ubuntu since what they're doing is "mere aggregation" (see the GPL).
Honestly, this whole debacle could have been avoided if ZFS had been relicensed under some kind of compatible license like one of the BSD licenses.
Both the GPL and CDDL are distribution licenses; using a ZFS module with Linux isn't a problem, you aren't violating any of the licences by using them together on your computer system. There's plenty of precedent for non-GPL kernel modules. It's not a problem in practice.
EDIT:  Well, now that I think about it... I don't agree that the skepticism should be based on licensing. I just think that ZFS-on-Linux is perhaps not mature enough quite yet and we do have off-site backups anyway, so we might as well try it if we're not doing anything absolutely 24/365-critical.
Once this is a single option in the installer like it is for LVM, it will become properly usable. Right now, it's for masochists who want to experiment only (for the rootfs).
Regarding maturity, it's certainly not as well tested as on e.g. FreeBSD or Solaris, but it's been around for a good while at this point and I know a few people who have been running it for several years trouble free. I would certainly place it above Btrfs in the reliability stakes. The main limitations I've seen are cosmetic or Linux-specific; the zfs command is setuid root, which prevents delegating dataset admin to users (snapshot/send/recv your data, create your own subsidiary datasets etc.), Linux doesn't do NFSv4 ACLs in its VFS, and you can't transparently create NFS exports via zfs properties. None of these affect data integrity though. Not holding my breath waiting for NFSv4 ACLs, though it would certainly be a massively beneficial feature.
Or if Linux had been similarly relicensed. I think that's actually more reasonable, since it's the Linux license that may take issue with CDDL code, not the reverse.
Besides that, ZFS is covered by several patents, and the CDDL grants a patent license. If ZFS had been BSD-licensed, you'd have a much worse debacle.
ZFS, OTOH, is owned by a single entity, so relicensing it would be simple. It should be completely possible to make a new license that grants a patent license but is also GPL compatible.
I'll take their word for it in good faith.
Taking close to two decades to make xfs an equal citizen would be another example.
HAMMER fs can dedup with little RAM, but the ZFS devs aren't ignorant or uninformed, so I would expect them to have a plan to remedy the inefficiency of OpenZFS's dedup implementation. I know there are commercially avaialbe alternaitve dedup versions, but I don't have any experience with those.
Two years ago, I migrated everything to ZFS and haven't looked back. It works. It hasn't lost any data, and it's easy to set up and administer. I can't give it any higher praise than that. It does its stuff with no fuss or drama.
The btrfs FAQ  says that "unlike MD-RAID, btrfs knows what blocks are actually used by data/metadata, and can use that information in a rebuild/recovery situation", but is that really a good enough reason to reimplement the entire RAID subsystem from scratch? And couldn't TRIM/discard provide the same benefits?
For me, the clincher is that mdadm has a very stable, well-defined on-disk format, which is a huge bonus if you hit a bug or make a mistake. I once almost lost a personal RAID10 array by recklessly trying to add an extra drive with no backup, and without fully understanding what I was doing. I was able to recover all of my data by hacking together a Python script to reassemble all of the blocks in the correct order. I can't imagine how much effort that would have taken if I had to build something that understood the full details of btrfs's data structures.
Moving storage pooling functionality to the filesystem is the right call for anything with server workloads, ZFS's self-healing functionality wouldn't work without it (and this is precisely why I use FreeNAS instead Unraid or other alternatives, I care about the long-term integrity of my data, especially my VMWare images and photo collection).
#1: Do checksumming at the block level. Within each block, the RAID driver reserves a few bytes at the end for a checksum, and verifies them before returning data to the FS. (This would only work if the filesystem supports non-power-of-two block sizes.)
#2: Similar to #1, but pack the checksums for multiple data blocks into a separate dedicated checksum block. (This adds some extra read latency in the worst case, but caching could mitigate it.)
#3: Let the filesystem handle checksumming, but extend the block device API to provide feedback to the kernel if the checksum is invalid. I think you only really need two API calls -- one which says "read this block as fast as possible and I'll verify it", and one which says "the block you returned looks bad, try to reconstruct it and give me all available candidates".
This doesn't handle phantom or misdirected writes. You can't really do that unless you keep the checksums elsewhere, and specifically in the place where you intend to reference the data.
> Let the filesystem handle checksumming, but extend the block device API to provide feedback to the kernel if the checksum is invalid.
I think by the time you get this API right, it's going to seem like just as bad a layering violation.
ZFS has a coherent set of layers built into it. They're just not the ones that existed before it. But I don't think those previous layers were designed with many of the important failure modes of real hardware devices in mind. That's understandable given when they were designed, but we've come a long way since then in the understanding of those failure modes and our expectations of system integrity.
At the filesystem level, you can specify different redundancy policies for different types of data. You can mirror metadata but not file contents, for example, which is a pretty good policy available in BTRFS that you can't achieve at the block level. You can also do sensible things like mirror across three drives by putting each file on two of the three drives. But there's a better solution.
The better solution is to handle this at a higher level. It's common to want data replicated across machines, at which point block-level redundancy really only sucks up disk space for no benefit. This is how cloud storage works, and it's how you set up an Exchange server these days, et cetera.
RAID and block-level redundancy will still be around (somebody will always need it), but it's a dying technology, at least in the sense that new deployments are using it less and less frequently.
But these layers also impede a number of important things needed for data integrity checking and efficient rebuilding on failure, as well as complicating administration when you need to rebuild bits of an array after failure or fiddle with logical volumes. From the point of view of a sysadmin, ZFS is a revelation. Datasets are simple; creating, resizing (quota), deleting are trivial and safe. Operations on the pool are logical and (relatively) safer than with mdraid or hardware RAID. Snapshots are safer--you are doing it at the filesystem level rather than the block device level, so it can never be inconsistent or get corrupted when your snapshot device is invalidated at some arbitrary future point when the block delta exceeds the device size.
The layering is still there. It's just subtly different. If you're used to LVM, then a "pool" is basically a volume group. The "vdev"s making up the pool are the physical volumes, which might be RAID sets. The "datasets" or "zvols" are logical volumes. It's a reinvention of what we already had, but it's more powerful, more flexible, and vastly easier to administer.
I was sceptical, but after learning about it (and there is a learning curve), doing some simple test installs (often repeated as I realised I'd not set up the pool with the right blocksize or partition alignment, or the optimal dataset structure) and then some actual deployments, I'm a convert. What it offers is currently unmatched.
As I learned the hard way, I was wrong and that warning is very much justified.
I've run Btrfs on many systems since just after it started to be usable, and written software with Btrfs-specific support which hammers it (and LVM) like nothing else creating and destroying tens of thousands of transient snapshots. I've also now run ZFS on several systems, admittedly over a smaller timeframe (3 years vs 7-8ish).
I've had Btrfs totally trash a RAID1 mirror from a transient SATA cable connector glitch. On this test system, I had half the disk using Btrfs, half using mdraid/LVM. The mdraid half recovered and resynced transparently as soon as I reseated the connector; no service interruption or dataloss. Btrfs ceased to function, and on reboot toasted both mirrors resulting in total unrecoverable dataloss and repeated kernel panics. That's been fixed a while, but right here we're seeing the same thing. The failure codepaths, which are of critical importance, are untested and buggy. And even non-failure codepaths are still bad. Take the snapshotting case above, I had to take the system offline and do a full manual rebalance every 18 hours. The time from fresh new filesystem to read-only unbalanced disaster was just 18 hours when thrashed continuously, at most using 10% of the total space. And lastly, the performance of some things such as fsync are truly abysmal, to the extent that we had to use "eatmydata" to completely disable it for apt/dpkg operations! When under heavy parallel workloads, it could take many tens of minutes or hours(!) to complete writes which ext4 would complete in a minute or so.
I've yet to experience any problems at all with ZFS. Now that might have been luck on my part, but it might also be down to better design and quality of implementation. It's certainly been battle tested in high end installations. That's not to say that Btrfs doesn't have some neat features; it does a few things ZFS doesn't, like rebalancing data over its devices while ZFS only does that on write. But Btrfs has let me down badly every time I've used it in anger, and those few neat features don't make up for its lack of robustness--the primary purpose of the filesystem is to reliably store data, and it fails at that.
I don't like to see "mud slinging", since such fanboyism is unobjective and uninformed. I've reached my opinion based upon several years of practical intensive use of Btrfs for various things, the most demanding of which was repeated whole-archive rebuilds of the whole of Debian when I was maintaining the Debian build tools, and wrote btrfs snapshot support specifically for them, doing over 30 parallel builds on a single system using independent snapshots per build with over 20000 snapshots per run, creating and destroying several per second. The experiment was disastrous, and showed Btrfs to be unsuitable for such intensive workloads. When your filesystem is guaranteed be turned read-only at some unpredictable and unknown point in the future, you can't rely on it. Regular rebalancing mitigates but doesn't solve this, and has a terrible performance impact. Not dataloss per se (unless it makes you lose writes when it turns read-only), but it's a serious design or implementation flaw. I did all this testing and adding of Btrfs support to various tools because I had high hopes for its potential; unfortunately they exposed serious shortcomings, many of which exist to this day. Today I'm using ZFS, not because of any irrational prejudice against Btrfs, but because Btrfs has never managed to deliver a robust and well tested filesystem!
btrfs switched to readonly as fast as ext4 (which is probably the correct thing to do) but was useless for this problem.
On new hardware with new enterprise disks we choose btrfs and we had a painful tour of crashes, data corruption, metadata corruption (undeletable files), deadlocks until kernel 4.4 where things got a little bit better. Here disks and server where enterprise class and fully working. Just btrfs bugs. Also no RAID. I'm not doing that anymore so I don't know if any new bugs appeared but the whole experience will keep me from ever using btrfs. This was ~2years ago and you could easily find lot's of slides Fujitsu or Suse that btrfs is stable and you can use it (around kernel 3.13-3.16).
It's probably fine for your notebook or even your backup HDD but don't think you can stress it without experiencing pain (be it corruption, hangups or dataloss) or just abysmal performance.
That beeing said ZFS on Linux is also a far cry from rock solid but I'm optimistic that they flesh out the problems and tackle them in a solid way.
As a Linux fanboy for years this gave me some solid appreciation for Solaris engineering.
If you've been following along at all in btrfs development, this doesn't really come as a huge surprise.
Anything is possible but the work required would be a big challenge.