Hey HN,
since I'm personally affected by this bug[1] and this scared me somehow, I thought I'd post my research results to have a discussion about the details.
I must say I lost a bit of trust in ZFS, since it was 'sold' as rock solid unbreakable file system and now what... the sheer complexity of ZFS let's me think that this was not the only bug of this kind existing undetected in the source code.
However the communication was clear and precise, other filesystems had similar problems and I just love the feature set of zfs. So I'll give it another shot hoping that future revisions will be safe(r).
Here are my conclusions so far:
1. ZFS had a silent data corruption bug recently discussed on HN [6][7]
2. `Silent` means, you can't do anything about it without upgrading zfs (scrubs, checks, etc. don't help)
3. There is no tool to check if your data has been corrupted
4. Only once written and untouched data should be relatively safe (e.g. backups), the bug mainly affects reading data [4]
5. Setting `zfs_dmu_offset_next_sync=0` reduces probability of being affected, but no guarantee [4]
6. The bug affected any version before 2.2.2 and 2.1.14[2], but were more likely between 2.1.4 and 2.2.1 because a behaviour change in coreutils
7. The causing problem already existed much longer (since 2006?) - this is a question mark [4]
8. do a `zfs version` to see the version number
9. If you're scared (I was), please read this comment: https://news.ycombinator.com/item?id=38553342
So the only way to ENSURE your data is ok is checksumming all files on another filesystem at file level and compare or re-transfering all data from another filesystem after upgrading zfs to the latest version.
Please let me know, if I missed something or got something wrong.
Sources:
[1]: https://www.phoronix.com/news/OpenZFS-Data-Corruption-Battle
[2]: https://www.phoronix.com/news/OpenZFS-2.2.2-Released
[3]: https://www.theregister.com/2023/12/04/two_new_versions_of_o...
[4]: https://www.reddit.com/r/zfs/comments/1826lgs/psa_its_not_bl...
[5]: https://news.ycombinator.com/item?id=38519382
[6]: https://news.ycombinator.com/item?id=38405731
[7]: https://news.ycombinator.com/item?id=38380240
Here's a post by RobN (the dev who wrote the fix) on the ZFS On Linux mailing list
> There's a really important subtlety that a lot of people are missing in this. The bug is _not_ in reads. If you read data, its there. The bug is that sometimes, asking the filesystem "is there data here?" it says "no" when it should say "yes". This distinction is important, because the vast majority of programs do not ask this - they just read.
> Further, the answer only comes back "no" when it should be "yes" if there has been a write on that part of the file, where there was no data before (so overwriting data will not trip it), at the same moment from another thread, and at a time where the file is being synced out already, which means it had a change in the previous transaction and in this one.
> And then, the gap you have to hit is in the tens of machine instructions.
> This makes it very hard to suggest an actual probability, because this is a sequence and timing of events that basically doesn't happen in real workloads, save for certain kinds of parallel build systems, which combine generated object files into a larger compiled program in very short amounts of time.
> And even _then_, all this supposes that you do all this stuff, and don't then use the destination file, because if you did, you would have noticed that its incomplete.
> So while I would never say that no one has ever hit the problem unknowingly, I feel pretty confident that they haven't. And if you're not sure, ask yourself if you've ever had highly parallel workloads that involve writing and seeking the same files at the same moment.
https://zfsonlinux.topicbox.com/groups/zfs-discuss/Tcf27ae8f...
Here's another writeup by another ZFS dev Rincebrain https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574...
I think the only reason this has gotten so much attention is because it came up as a block cloning bug (which it's not) and that being a new feature created a massive scare that it's widespread. This isn't the first or the last bug ZFS has had - it's software.