why would you want to embed raid5/6 in the filesystem layer? Linux has battle-te...

wheybags · on Jan 10, 2020

Because the FS can be deeply integrated with the RAID implementation. With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other. With ZFS for example, there is a checksum stored with the data, so when you read, zfs will check the data on both and pick the correct one. It will also overwrite the incorrect version with the correct one, and log the error. It's the same kind of story with encryption, if its built in you can do things like incremental backups of an encrypted drive, without ever decrypting it on the target.

notacoward · on Jan 10, 2020

> when you read, zfs will check the data on both and pick the correct one.

Are you sure about that? Always reading both doubles read I/O, and benchmarks show no such effect.

> there's no way for the fs to tell which is correct

This is not an immutable fact that precludes keeping the RAID implementation separate. If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID). I've seen it detect and fix bad blocks, many times. It works, even with separate layers.

This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH. Understanding other complex subsystems is hard, and it's more fun to write new code instead.

mbreese · on Jan 10, 2020

> If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID).

This is all great, and I assume it works great. But it is no way generalizable to all the filesystems Linux has to support (at least at the moment). I could only see this working in a few specific instances with a particular set of FS setups. Even more complicating is the fact that most RAIDS are hardware based, so just using ioctls to pull individual blocks wouldn’t work for many (all?) drivers. Convincing everyone to switch over to software raids would take a lot of effort.

There is a legitimate need for these types of tools in the sub-PB, non-clustered, storage arena. If you’re working on a sufficiently large storage system, these tools and techniques are probably par for the course. That said, I definitely have lost 100GBs of data from a multi-PB storage system from a top 500 HPC system due to bit rot. (One bad byte in a compressed data file left the data after the bad byte unrecoverable). This would not have happened on ZFS.

ZFS was/is a good effort to bring this functionality lower down the storage hierarchy. And it worked because it had knowledge about all of the storage layers. Checksumming files/chunks helps best if you know about the file system and which files are still present. And it only makes a difference if you can access the lower level storage devices to identify and fix problems.

notacoward · on Jan 10, 2020

> it is no way generalizable to all the filesystems Linux has to support

Why not? If it's a standard LVM API then it's far more general than sucking everything into one filesystem like ZFS did. Much of this block-mapping interface already exists, though I'm not sure whether it covers this specific use case.

throw0101a · on Jan 10, 2020

> This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH.

At the time that ZFS was written (early 2000s) and released to the public (2006), this was not a thing and the idea was somewhat novel / 'controversial'. Jeff Bonwick, ZFS co-creator, lays out their thinking:

* https://blogs.oracle.com/bonwick/rampant-layering-violation

Remember: this was a time when Veritas Volume Manager (VxVM) and other software still ruled the enterprise world.

* https://en.wikipedia.org/wiki/Veritas_Storage_Foundation

notacoward · on Jan 10, 2020

I debated some of this with Bonwick (and Cantrill who really had no business being involved but he's pernicious that way) at the time. That blog post is, frankly, a bit misleading. The storage "stack" isn't really a stack. It's a DAG. Multiple kinds of devices, multiple filesystems plus raw block users (yes they still exist and sometimes even have reason to), multiple kinds of functionality in between. An LVM API allows some of this to have M users above and N providers below, for M+N total connections instead of M*N. To borrow Bonwick's own condescending turn of phrase, that's math. The "telescoping" he mentions works fine when your storage stack really is a stack, which might have made sense in a not-so-open Sun context, but in the broader world where multiple options are available at every level it's still bad engineering.

throw0101a · on Jan 11, 2020

> ... but in the broader world where multiple options are available at every level it's still bad engineering.

When Sun added ZFS to Solaris, they did not get rid of UFS and/or SVM, nor prevent Veritas from being installed. When FreeBSD added ZFS, they did not get rid of UFS or GEOM either.

If an admin wanted or wants (or needs) to use the 'old' way of doing things they can.

bcantrill · on Jan 10, 2020

Sorry, I'm pernicious in what way, exactly?

notacoward · on Jan 10, 2020

Heh. I was wondering if you were following (perhaps participating in) this thread. "Pernicious" was perhaps a meaner word than I meant. How about "ubiquitous"?

rhinoceraptor · on Jan 10, 2020

The fact that traditionally RAID, LVM, etc. are not part of the filesystem is just an accident of history. It's just that no one wanted to rewrite their single disk filesystems now that they needed to support multiple disks. And the fact that administering storage is so uniquely hard is a direct result of that.

notacoward · on Jan 10, 2020

However it happened, modularity is still a good thing. It allows multiple filesystems (and other things that aren't quite filesystems) to take advantage of the same functionality, even concurrently, instead of each reinventing a slightly different and likely inferior wheel. It should not be abandoned lightly. Is "modularity bad" really the hill you want to defend?

throw0101a · on Jan 10, 2020

> However it happened, modularity is still a good thing.

It may be a good thing, and it may not. Linux has a bajillion file systems, some more useful than others, and that is unique in some ways.

Solaris and other enterprise-y Unixes at the time only had one. Even the BSDs generally only have a few that they run on instead of ext2/3/4, XFS, ReiserFS (remember when that was going to take over?), btrfs, bcachefs, etc, etc, etc.

At most, a company may have purchased a license for Veritas:

* https://en.wikipedia.org/wiki/Veritas_Storage_Foundation

By rolling everything together, you get ACID writes, atomic space-efficient low-overhead snapshots, storage pools, etc. All this just be removing one layer of indirection and doing some telescoping:

* https://blogs.oracle.com/bonwick/rampant-layering-violation

It's not "modularity bad", but that to achieve the same result someone would have had to write/expand a layer-to-layer API to achieve the same results, and no one did. Also, as a first-order estimate of complexity: how many lines of code (LoC) are there in mdraid/LVM/ext4 versus ZFS (or UFS+SVM on Solaris).

rhinoceraptor · on Jan 10, 2020

Other than esoteric high performance use cases, I'm not really sure why you would really need a plethora of filesystems. And the list of them that can be actually trusted is very short.

notacoward · on Jan 10, 2020

I'd like to agree, but I don't think the exceptions are all that esoteric. Like most people I'd consider XFS to be the default choice on Linux. It's a solid choice all around, and also has some features like project quota and realtime that others don't. OTOH, even in this thread there's plenty of sentiment around btrfs and bcachefs because of their own unique features (e.g. snapshots). Log-structured filesystems still have a lot of promise to do better on NVM, though that promise has been achingly slow to materialize. Most importantly, having generic functionality implemented in a generic subsystem instead of in a specific filesystem allows multiple approaches to be developed and compared on a level playing field, which is better for innovation overall. Glomming everything together stifles innovation on any specific piece, as network/peripheral-bus vendors discovered to their chagrin long ago.

boomer_joe · on Jan 10, 2020

>I work on one of the four or five largest storage systems in the world

What would you recommend over zfs for small-scale storage servers? XFS with mdraid?

I'd also love to hear your opinion on the Reiser5 paper.

vetinari · on Jan 10, 2020

> With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other.

That's problem only with RAID1, only when copies=2 (granted, most often used case) and only when the underlying device cannot report which sector has gone bad.

tremon · on Jan 10, 2020

why would you want to embed raid5/6 in the filesystem layer?

There are valid reasons, most having to do with filesystem usage and optimization. Off the top of my head:

- more efficient re-syncs after failure (don't need to re-sync every block, only the blocks that were in use on the failed disk)

- can reconstruct data not only on disk self-reporting, but also on filesystem metadata errors (CRC errors, inconsistent dentries)

- different RAID profiles for different parts of the filesystem (think: parity raid for large files, raid10 for database files, no raid for tmp, N raid1 copies for filesystem metadata)

and for filesystem encryption:

- CBC ciphers have a common weakness: the block size is constant. If you use FS-object encryption instead of whole-FS encryption, the block size, offset and even the encryption keys can be varied across the disk.

rhinoceraptor · on Jan 10, 2020

I think to even call volume management a "layer" as though traditional storage was designed from first principles, is a mistake.

Volume management is a just a hack. We had all of these single-disk filesystems, but single disks were too small. So volume management was invented to present the illusion (in other words, lie) that they were still on single disks.

If you replace "disk" with "DIMM", it's immediately obvious that volume management is ridiculous. When you add a DIMM to a machine, it just works. There's no volume management for DIMMs.

lolc · on Jan 10, 2020

Indeed there is no volume management for RAM. You have to reboot to rebuild the memory layout! RAM is higher in the caching hierarchy and can be rebuilt at smaller cost. You can't resize RAM while keeping data because nobody bothered to introduce volume management for RAM.

Storage is at the bottom of the caching hierarchy where people get inventive to avoid rebuilding. Rebuilding would be really costly there. Hence we use volume management to spare us the cost of rebuilding.

RAM also tends to have uniform performance. Which is not true for disk storage. So while you don't usually want to control data placement in RAM, you very much want to control what data goes on what disk. So the analogy confuses concepts rather than illuminating commonalities.

dralley · on Jan 11, 2020

One of my old co-workers said that one of the most impressive things he's seen in his career was a traveling IBM tech demo in the back of a semi truck where they would physically remove memory, CPUs, and disks from the machine without impacting the live computation being executed apart from making it slower, and then adding those resources back to the machine and watching them get recognized and utilized again.

throw0101a · on Jan 10, 2020

> why would you want to embed raid5/6 in the filesystem layer?

One of the creators of ZFS, Jess Bonwick, explained it in 2007:

> While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic. We found that by refactoring the problem a bit -- that is, changing where the boundaries are between layers -- we could make the whole thing much simpler.

* https://blogs.oracle.com/bonwick/rampant-layering-violation

pizza234 · on Jan 10, 2020

It's not about ZFS. It's about CoW filesystems in general; since they offer functionalities beyond the FS layer, they are both filesystems and logical volume managers.

ptman · on Jan 10, 2020

Why does ZFS do RAIDZ in the filesystem layer?

rleigh · on Jan 10, 2020

It doesn't.

RAIDZ is part of the VDEV (Virtual Device) layer. Layered on top of this is the ZIO (ZFS I/O layer). Together, these form the SPA (Storage Pool Allocator).

On top of this layer we have the ARC, L2ARC and ZIL. (Adaptive Replacement Caches and ZFS Intent Log).

Then on top of this layer we have the DMU (Data Management Unit), and then on top of that we have the DSL (Dataset and Snapshot Layer). Together, the SPA and DSL layers implement the Meta-Object Set layer, which in turn provides the Object Set layer. These implement the primitives for building a filesystem and the various file types it can store (directories, files, symlinks, devices etc.) along with the ZPL and ZAP layers (ZFS POSIX Layer and ZFS Attribute Processor), which hook into the VFS.

ZFS isn't just a filesystem. It contains as many, if not more, levels of layering than any RAID and volume management setup composed of separate parts like mdraid+LVM or similar, but much better integrated with each other.

It can also store stuff that isn't a filesystem. ZVOLs are fixed size storage presented as block devices. You could potentially write additional storage facilities yourself as extensions, e.g. an object storage layer.