
Why ZFS is not good at growing and reshaping pools, or shrinking them - zdw
https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSWhyNoRealReshaping
======
tobias3
As far as I know it is a fundamental design difference between btrfs and ZFS.
Btrfs has extent back references. Given a location on disk btrfs has an
efficient data structure that gets from the disk location to the items using
that disk location. E.g. if a sector on disk gets corrupted it has a command
that lets you list all the the files using that sector. With ZFS you'd have to
iterate through all files, list block locations and check if one of them is
affected. Of course such an additional index comes with a performance cost for
managing that index (any kind of write or delete operation plus more metadata
space used). On the top of my head this is used for rebalance, quota groups,
device removal, send/receive (finding reflinks) and file system shrinking.

For a similar trade off compare hard links on Windows vs. ext4 and co. on
Linux. On Linux there is no command to list all hard links to a file while
Windows (NTFS) has one (FindFirstFileName etc.). This is why creating hard
links on Windows is much slower and there is a limit of 1024 hard links to a
file. To find all hard links to a file on Linux one would need to iterate over
all files and check if they have the same inode number, however.

~~~
paulmd
> With ZFS you'd have to iterate through all files, list block locations and
> check if one of them is affected.

I mean... you just described a scrub. I'd gladly sit through a scrub if it
meant that I could grow a raidz vdev.

I'm suspecting there's some other hurdle.

~~~
rainbowzootsuit
Raidz expansion is currently a project under development in openzfs.

[https://m.youtube.com/watch?v=ZF8V7Tc9G28](https://m.youtube.com/watch?v=ZF8V7Tc9G28)

[https://youtu.be/Njt82e_3qVo](https://youtu.be/Njt82e_3qVo)

~~~
justinclift
Also, the work-in-progress (alpha quality) preview code:

[https://github.com/zfsonlinux/zfs/pull/8853](https://github.com/zfsonlinux/zfs/pull/8853)

------
sliken
One thing I like about btrfs is offline deduplication. So whenever you want
you can search for duplications and then tell the filesystem about what you
find. This avoids many of the performance and ram impacts of ZFS
deduplication.

~~~
boomboomsubban
Are you really saving that much space doing that on a block level rather than
a file level?

~~~
Dylan16807
I'm not sure what you're comparing to. Both filesystems do deduplication at a
block level. But Btrfs can do reflink copies and ZFS can't. And Btrfs can do
after-the-fact deduplication and ZFS can't.

~~~
garmaine
He's saying scan for files that are the same and hard-link them. Both
filesystems can do this after-the-fact.

Of course this only works if the data is read-only.

~~~
Dylan16807
> Both filesystems can do this after-the-fact.

It won't reclaim any space on ZFS if you're using snapshots.

------
bcwu
RAID-Z expansion is in the works...

[http://open-zfs.org/w/images/6/68/RAIDZ_Expansion_v2.pdf](http://open-
zfs.org/w/images/6/68/RAIDZ_Expansion_v2.pdf)

[https://www.youtube.com/watch?v=Njt82e_3qVo](https://www.youtube.com/watch?v=Njt82e_3qVo)

~~~
louwrentius
That was announced about 2 years ago.

~~~
tecleandor
There's ongoing work on this PR:

[https://github.com/zfsonlinux/zfs/pull/8853](https://github.com/zfsonlinux/zfs/pull/8853)

------
hinkley
Even before ZFS I wanted there to exist a RAID array where you could just swap
out a disk to increase capacity.

I wanted it in part for me, but I also wanted it for all the people in my life
who were doing a terrible job of taking care of their important data.

“Don’t worry,” I wanted to tell them. “Get one of these, slap some drives in
it. When you need some more space, buy a new drive, take the smallest one out
and put the new one in.”

Some of the early PR for ZFS was unclear on whether that was possible, but I
was excited. This was by far the closest we’d gotten. Later it was a feature
we’d have soon. I waited to adopt. Apple was going to support it, here we
come. Apple was _not_ going to support it. Okay, NAS could still happen. Sun
craters and I lose track, but check in every so often, and still nothing.

In the grand scheme of things, it is not the biggest tech disappointment I’ve
suffered. But it feels like it should be in the top ten, even if it’s #10.

~~~
tatersolid
Windows Storage Spaces and Drobo both have the “grow proportionally onto
bigger or differently sized disks” feature

------
the8472
Isn't that a bit of rewriting the history? Originally the roadmap was that
these kinds of features would be added based on the mythical block pointer
rewrite. Eventually BPR and thus all features requiring it were put on ice.

~~~
secabeen
Yeah, I think Sun/Oracle recognized that there was a subset of their user base
that would greatly benefit from BPR and all the associated features that come
from it, but that it wasn't their core customers, so it wasn't included in the
initial implementation (in my experience, most enterprise customers would
rather replace an entire array that was full than expand or reshape it.)

Had Sun remained the powerhouse it was, the BPR would probably probably have
eventually gotten done, but with the smaller resources devoted to ZFS
development now vs in the Sun days, it's gotten kicked way down the priority
list.

~~~
toast0
Enterprise users are most often using enterprise servers with fancy built in
disk enclosures, and they're likely to fill those up with disks. Expanding the
storage by replacing with higher capacity disks makes sense; expanding by
using more disks isn't as likely. And changing the configuration to have fewer
disks is also less likely; you paid for all those slots, so they're going to
be filled.

~~~
tjoff
> _Expanding the storage by replacing with higher capacity disks makes sense_

That is an insanely time consuming operation that also increases the risk of
data loss as you have less redundancy during the expansion.

An enterprise ought to just migrate the entire pool/content to somewhere else
and redo it completely.

Expanding by replacing with higher capacity drives just feels like a hack and
afterthought that only hobbyists would actually make use of.

~~~
the8472
> as you have less redundancy during the expansion.

If the old device is not faulty then the replace operation will keep both old
and new in the array until it is complete.

> An enterprise ought to just migrate the entire pool/content to somewhere
> else and redo it completely.

That's also an option with send/receive, but that involves downtime. Drive
replacements can be done on a live system. Server grade hardware generally
supports hotplug.

~~~
hinkley
If you left one bay empty, that is.

Which you probably should.

~~~
the8472
you can run the replace to some form of external drive and then swap the
physical drive once the replace is done.

------
cmurf
The indirection that exists in Btrfs is the internal virtual address space and
block groups. Metadata and data extents are referenced in virtual addresses,
bytes. That address is translated by the chunk (block group) tree into two
things: a physical device, and location on that device. Conceptually it's not
altogether different from ext4 on LVM, and hence the old critique that Btrfs
is a layering violation. But these things are integrated on Btrfs, and the
chunk is what makes so many things flexible, as well as a PITA when it comes
to ENOSPC (mostly solved these days).

------
cryptonector
> One fundamental reason is that ZFS is philosophically and practically
> opposed to rewriting existing data on disk; ...

That's not quite the issue.

The issue is that PHYSICAL and HASH block addresses are stored together,
intermingled, which means that the PHYSICAL addresses get hashed into the
copy-on-write tree, which means any block can't be relocated (i.e., have its
physical address(es) changed) without rewriting the path to the block
containing a pointer to it as well.

I.e., ZFS is not content-addressed storage. In CAS the hash of a block _is_
its pointer, and the physical addresses are stored separately (not
intermingled with hashed data) and _not_ hashed into the tree.

IF instead ZFS had used ONLY the hash value as the blkptr_t, and then had
stored the physical addresses at the end of metadata blocks and then NOT
hashed the physical addresses into the tree, THEN relocating would be a lot
easier. Not trivial, mind you, just easier. Not trivial because of snapshots
and dedup: you can find a singular pointer to any data block in the absence of
snapshots and dedup, and rewrite the physical address stored nearby, but you
can't easily find duplicate pointers, so to relocate blocks in CAS w/
snapshots and dedup you have to leave behind forwardings, or have a database
of relocations (which is costly).

> (In the grand tradition of computer science we can sort of solve this
> problem with a layer of indirection, where the top layer stays immutable but
> the bottom layer mutates. This is awkward and doesn't entirely satisfy
> either side, and is in fact how ZFS's relatively new pool shrinking works.)

Yes, and proper CAS also has a layer of indirection, and some awkwardness
around snapshots and dedup (see above). For mirror drive evacuation there's
nothing to do. For shrinking operations you only need to relocate blocks whose
physical addresses are beyond the end of the new device (but you'll need a
fair bit of swing storage for the interim relocation or to use as forwardings
until you've found and fixed all pointers to any one relocated block).

I've said this many times in the past, and always someone complains that not
hashing the physical addresses is unsafe, as if the hash of the contents is
not safe but hash of contents and physical address could somehow be so much
safer (it can't be much safer, if at all). And again, see CAS.

------
microcolonel
bcachefs is looking promising to me. I hope that it doesn't unlearn any of the
lessons we've learned over the years from ZFS, which is itself excellent in
many ways.

------
anon9001
I might be misguided here (please correct me), but I feel like ZFS and
hardware RAID had their place pre-SSD, and now they don't make much sense.

If you want an array of disks with a mutable configuration, I think the right
choice is mdadm and LVM. They're reliable, relatively easy to use, let you
move disks between machines, don't rely on hardware RAID, and support all file
systems.

If you need something faster and more optimized like ZFS or btrfs, you're
probably better off spending on SSDs anyway.

Tell me why I'm wrong, HN :)

~~~
Wowfunhappy
I think most people use ZFS for data integrity, not performance.

I make backups regularly (manually, because I want them to be offline), but
I'd rather not have to use them if one of my drives die. Unless I literally
just finished making the backup, I'm going to lose some amount of work.

My mirrored ZFS pool means that (A) a drive can die and I won't lose _any_
data, no matter how recent, and (B) I'm confident none of the data on either
drive is corrupted, because ZFS checks it regularly (every time it's accessed,
and during weekly scrubs).

~~~
stiray
Exactly. Anyway, my home server is having 6Tb redundant storage (3xToshiba
3Tb) and 10Tb Hitachi drive, non redundant. What would the cost of SSDs be?
Over $1000 not to mention the controlle 4 SATA port vs 19+ port controller is
a huge difference in price.

I do use SSDs, where I need speed - L2ARC ;), far better use than buying 19Tb
of SSDs, not to mention the price.

Anyway, I do use ZFS on laptop for linux root. Reason? Snapshots. Kvm + ZFS,
perfect match. /.zfs/snapshots? Fabulous.

~~~
nickik
There is this fun backup tool, that have yet to set up.

[https://www.znapzend.org/](https://www.znapzend.org/)

