ZFS copies=n is not a substitute for device redundancy (2016)

diplomatpuppy · on Jan 17, 2021

Can I add another gotcha - if you have a failed device in zfs and you can't import the pool anymore, don't think you can walk the drive and put the blocks back together manually.

You can do this in FAT and mostly in NTFS and UFS - especially if your drive is defragmented.

But the complicated data structures of ZFS makes this really hard if not close to impossible - there are tools and ways of using ZFS code to help you do this, but they are exceedingly hard to use.

Source: took me 52 hours to retrieve a 16k file from a failed device. Granted it would take me less time now, but I now think of devices the have failed as if they has completely disappeared from our universe.

mulmen · on Jan 17, 2021

> I now think of devices the have failed as if they has completely disappeared form our universe.

I interpret this in the same way as “The network is compromised.”

It may not be literally true but you can avoid pain by always assuming it is true and acting accordingly.

dehrmann · on Jan 18, 2021

10-ish years ago, I had a pool where my log device was irrecoverable. I just lose the last ~minute of writes, right? Nope; the pool wouldn't mount. I had to research the internals of zfs and create a dummy log device and rewrite the id the pool expected. I'm told this has been fixed.

simcop2387 · on Jan 18, 2021

It has been, I had a log device fail and was able to import the pool just fine. Had to do it manually though because the automated scripts at boot didn't like it which is a fine default since erroring out by default when something is fishy is a good idea.

pstrateman · on Jan 18, 2021

That's been fixed for... 10-ish years.

You can now mount a pool without the log device and only lose what was in the log, with just a flag to import.

unilynx · on Jan 18, 2021

ReiserFS could apparently actually do that, rebuilding the FS by scanning the blocks... and would even merge contained ReiserFS images (for eg. virtual machines) into the top level FS.

No personal experience though, but see https://lwn.net/Articles/470517/

mgerdts · on Jan 18, 2021

copies=n was not intended to be used in place of mirroring - it was really intended for systems where mirroring wasn't an option. While the initial proposal didn't call it out explicitly, those that were "in the know" on the thread called this out as a feature to help with failures on laptops. For instance, this message on Sept 12, 2006:

  Dick Davies wrote:
  > The only real use I'd see would be for redundant copies
  > on a single disk, but then why wouldn't I just add a disk?

  Some systems have physical space for only a single drive - think most
  laptops!

  --
  Darren J Moffat

It seems this thread has disappeared from the internet. If anyone is interested in zfs-discuss@opensolaris.org archives, I can probably convince gmail to turn it into an mbox and post it somewhere.

Edit: format. Sorry mobile users, I really need a block quote here.

mgerdts · on Jan 18, 2021

The original proposal: https://gist.github.com/mgerdts/2eeb60421e46dcbc24b4ce5adc76...

londons_explore · on Jan 17, 2021

On rotating media, I assume copies=n might improve performance... By multiple copies of a piece of data existing, whichever piece of data is closest to the read head can be read. Rotational delays on a 7200 RPM risk are about 10 milliseconds, so if you can halve that by picking a closer copy of the data, you'll get your data back quicker!

All this depends on the filesystem issuing reads for all the copies, the drive firmware correctly deciding which to read first, and then the OS being able to cancel the reads of the other copies.

I kinda doubt all the above logic is implemented and bug free... Which is sad :-(

dehrmann · on Jan 18, 2021

This would be a strategy for managing fragmentation. The main ways filesystems manage this are defragmentation and not filling up the disk on the first place. For performance, given a choice of copies=2 or 50% usage, I'd take 50% usage.

mulmen · on Jan 17, 2021

This sounds far fetched but I was surprised to learn how precise timing can really be. I believe it was a conversation about bluetooth proximity unlocking. Computer clocks are fast enough to measure distance to an object in human distances at light speed. That still blows my mind.

skykooler · on Jan 18, 2021

A minor point: you can't really measure distances by the speed of light via Bluetooth, the protocol is too coarse-grained. Bluetooth distance measurements are usually done via signal strength. However, time-of-flight distance measurements can be done on an extension of wifi called 802.11v - this is how the Apple Watch proximity unlock works, for example.

mbreese · on Jan 18, 2021

> this is how the Apple Watch proximity unlock works, for example.

I wonder — do you know if the proximity unlock only detect the relative distance between the computer and AP and compare that to the distance between the watch and AP? Or is it between the watch and computer?

The former would make sense to me, because my proximity unlock doesn’t always work immediately and it would make sense if it was because my watch and computer were connected to two different APs in my house (because one of the other hasn’t switched to the closest AP yet).

skykooler · on Jan 19, 2021

It is the direct time of flight between the computer and watch. However, the way Apple implemented it, the devices both connect to an AP first to mediate the connection; it's similar to how Airdrop connects two computers.

mulmen · on Jan 18, 2021

Ah, that may have been it, thanks! The fact it is possible with any consumer electronics at all amazes me.

midasuni · on Jan 18, 2021

How do covid Bluetooth apps work then?

paxswill · on Jan 18, 2021

They're using the coarser, signal strength based estimation.

garmaine · on Jan 18, 2021

It's the most intuitive explanation of the 2-3GHz limit that I know of: at those clockrates, the speed of light is about 1 m/s. Factor in the slower speed of electrical signaling in metals and semiconductors, and it is properly measured in cm/s.

yarg · on Jan 18, 2021

A minor quibble: your unit of time should be nanoseconds.

nitrogen · on Jan 18, 2021

That, or maybe clock ticks

garmaine · on Jan 18, 2021

That's what I meant. Speed of light is 300,000,000 m/s. Divide by 3Ghz (3,000,000,000 clock/s) and you get 10 cm/clock. Wave propagation in an actual conductor is slower still.

Edit window has elapsed though.

Piskvorrr · on Jan 18, 2021

So, essentially, the physical die size and c impose a hard constraint on tick propagation? Whoa!

garmaine · on Jan 18, 2021

Yup. Only way to go faster (single-core speed) is to shrink smaller, and reduce coordination between cores.

tinus_hn · on Jan 18, 2021

Typical kernels don’t consider rotational latency at all because seek latency is so much more important. Besides, the disk layout isn’t really available to the OS anyway, what the OS sees is an abstraction managed by the drive firmware.

Someone · on Jan 18, 2021

That’s why, back in the day, some people carefully placed often-accessed data in the middle of the disk.

I think that’s a long lost art, but it still might be done on mainframes (on the really old ones, files were mostly pre-allocated, with individual items stored in a partitioned data set (https://en.wikipedia.org/wiki/Data_set_(IBM_mainframe)#Parti...), so you could have some control where individual ‘files’ in a ‘directory’ got stored. Nowadays, you would have to partition a disk to do that)

I don’t think it has been cost effective to even think of this kind of black magic for decades, though.

namibj · on Jan 18, 2021

Which is almost always a linear mapping going from the outside first. To be fair, that doesn't include rotational latency, but still, the abstraction isn't as abstract as you make it sound.

tinus_hn · on Jan 18, 2021

How does that tell you how many sectors are in one track or how many heads are there?

Presenting the drive that consists of sectors in tracks on platters as if it is a very long string of sectors, how more abstract can you get? And that’s ignoring reallocated sectors and shingled recording shenanigans.

namibj · on Jan 18, 2021

I'd not be surprised if they since switched to using a true spiral. And the head count is one assembly per disk-surface for almost all devices. So 2 per disk for HDDs.

Yes, reallocated sectors aren't accounted for, but they should be so rare as to not matter outside of hard realtime applications, which shouldn't be using such corner-cutting devices anyways.

That spiral assumption is due to the servo tracks they need and that the inter-sector gap only has to be sized to account for how fast they can switch the write head from idle to spewing bits, so they have incentive to make it smaller than what they'd likely want the head to be able to jump when continuously streaming data.

Multiple platters would likely just mean that random writes with suitable alignment cost the same until you reach the effective platter count, as the sectors that fly by at the same time should be sequential.

SSDs on the other hand use some complicated LSM trees or similar datastructures.

qeternity · on Jan 17, 2021

I highly doubt that logic is implemented.

kevincox · on Jan 18, 2021

This was always something that surprised me about ZFS. For a fancy filesystem it largely copied RAID with a trivial layout. I always thought that it would be better to treat the devices as pools of storage and "schedule" writes across them in a much more flexible way. Instead of saying that "these two drives are mirrored" you should be able to say "I want to write two copies of this data" and for example it could pick the two idle drives, or the two emptiest drives. Same with striping, parity and other RAID options.

It seems like the only real advantage "ZFS RAID" has over "RAID + ZFS" is that it can handle the actual write requests separately and it has better options for reading when copies are damaged. But it seems like the layout is just as inflexible as a dumb RAID so we aren't gaining as much as we could by combining the two together.

(My knowledge may be out of date)

copies=n is obviously a step in the right direction but as mentioned it doesn't really provide enough to solve the problem.

It seems to me that the only real downside is that you need to store each location of a block, instead of storing one and assuming that the other locations are the same on the "matching" disks.

red0point · on Jan 17, 2021

First of all, this needs a (2016) tag.

Would an exact duplicate of the existing working drive, maybe done with dd, help with this? Maybe some Metadata from the drive layout would have to be changed, too.

Other than this workaround, it seems that ZFS could be changed to allow an import again. Has this been changed in recent years?

bombcar · on Jan 17, 2021

There was a post recently about building a home fileserver that mentioned a file system that did this - it was just a layer on top of existing disks and file systems and sorted files by directory (and could send a file to multiple drives if so desired).

I can’t find it now but it was an interesting website.

rainbowzootsuit · on Jan 17, 2021

Probably this:

https://news.ycombinator.com/item?id=25598190

bombcar · on Jan 17, 2021

That was it - I knew it had perfect in the name but man that search is just all sorts of useless

theon144 · on Jan 17, 2021

mergerfs?

bombcar · on Jan 18, 2021

Yep - that was it - and honestly for home use may be a much better option than the big boys like ZFS et al - especially if most of your data is backed up on the original media and in millions of stores nationwide wide (DVD rips, etc).

brnt · on Jan 18, 2021

This seems such a trivial observation that I fail to see the significance.

This option is there for on-device recovery, i.e. resistance to bitrot.

A new option involving forward error coding would be even better though. In-filesystem PAR/CRC anyone?

rocqua · on Jan 18, 2021

> This seems such a trivial observation that I fail to see the significance.

The fact that apparently many people think otherwise makes it significant.

brnt · on Jan 18, 2021

I guess so!