
Preserving data integrity: A ZFS-inspired storage system - d2wa
https://insanity.industries/post/preserving-data-integrity/
======
magicalhippo
When writing a post titled "Preserving data integrity", I find it weird not to
actually demonstrate that the data integrity is preserved.

I assume md-raid is proven so removing a disk isn't of primary concern, but at
the very least power down the system, and overwrite a few random spots in the
data partition on a couple of the disks. Power the system back up and see how
the pool reacts when reading everything. Verify that the data is correct.

I mean, write and read performance means little without actual integrity, or
if you need to perform some arcane incantations to get the system running
smoothly again.

------
kroeckx
This not provide the same integrity as ZFS. Dm-integrity only protects against
corruption on the disk itself, while ZFS attempts to provide protection
against all sources of corruption, including software/firmware bugs, RAM and
I/O path.

ZFS is using a Merkle tree and so knows which checksum to expect before
reading the data. With raidz/mirror, when it detect corruption, it will
attempt to read from the other drive(s) and combines those disks that give the
correct checksum. If it can't find a combination that works, just what depends
on that block is not available. If it can recover the block it will attempt to
repair the disk(s) with corruption by writing the correct data to it again.

Standard mdraid can not repair an array in case of corruption since it doesn't
know which of the data it has is correct or not, you need to figure out which
of the disks has corrupt data on it, and manually remove the drive from the
array and hope none of the other disks has corruption. In theory in case of a
read error mdraid could attempt to fix it, but it doesn't and just removes the
drive from the array. Dm-integrity provides more guarantees that the data you
get is correct, but since mdraid and dm-integrity are different layers, they
don't actually work together. There is no attempt to repair in case of
corruption, dm-integrity will just return an read error and mdraid will remove
the drive from the array.

~~~
cmurf
Whether uncorrectable read error (bad sector) reported by drive firmware, or
dm-integrity detecting corruption, the affected LBA's are propagated up to md.
And then md can determine the location of a copy (mirror or reconstruct from
parity). It then overwrites the bad location.

This mechanism is often thwarted with consumer drives, when their SCT ERC
timeout can't be set or is longer than the kernel's SCSI command timer default
of 30 seconds. Once a command hasn't returned a result of some kind within
30s, the SCSI driver does a link reset. On SATA this has a pernicious effect
of clearing the entire command queue, not just the one that was hung up in
"deep recovery". LBA's aren't returned so it's indeterminate where or what the
problem was caused by, no fix up happens. This results in bad sector
accumulation. This misconfiguration is common, and routinely costs people
their data.

[https://raid.wiki.kernel.org/index.php/Timeout_Mismatch](https://raid.wiki.kernel.org/index.php/Timeout_Mismatch)

This may be easier and more reliable to do with a udev rule; but the concept
is the same. Also, while this is linux-raid@ wiki, it doesn't only apply to
mdadm raid, but LVM, and Btrfs as well. I don't know if it applies to ZoL
because I don't know about all the layers ZFS implements itself separate from
Linux. But if it depends at all on the SCSI driver for error handling, it
would be at risk of this misconfiguration as well.

~~~
kroeckx
It seems that a HGST/WD Ultastar, which is their data center drive line, has
ERC disabled by default. I've now set it to the suggested 7 seconds.

------
fmajid
I understand the licensing issues, but having a system with 4 components
(ext4, dm-crypt, mdraid, dm-integrity) instead of a single integrated one
(ZFS) can hardly be said to be simpler. On a distribution like Ubuntu, adding
and maintaining ZFS is completely painless.

~~~
throw0101a
> _I understand the licensing issues, but having a system with 4 components_
> […]

I am reminded of ZFS co-creator Jeff Bonwick's "Rampant Layering Violation?"
(then-Sun) weblog post:

* [https://web.archive.org/web/20070508214221/http://blogs.sun....](https://web.archive.org/web/20070508214221/http://blogs.sun.com/bonwick/entry/rampant_layering_violation)

* [https://blogs.oracle.com/bonwick/rampant-layering-violation](https://blogs.oracle.com/bonwick/rampant-layering-violation)

~~~
nix23
Sometimes A monolith is "better" than the "Unix-way", think of kernels,
network-stack and Filesystems (Btrfs,ZFS) and databases.

~~~
throw0101a
Well, ZFS isn't _exactly_ monolithic if you look under the hood: it has the
ZPL (files, directories), DMU (objects, transactions on those objects), SPA
(actual disk I/O).

A potato-quality video from 2008 with Moore and Bonwick, the creators
(timestamped to relevant section):

* [https://www.youtube.com/watch?v=NRoUC9P1PmA&t=14m19s](https://www.youtube.com/watch?v=NRoUC9P1PmA&t=14m19s)

~~~
magicalhippo
Possibly less potato-quality video about the same:

[https://youtu.be/MsY-BafQgj4?t=442](https://youtu.be/MsY-BafQgj4?t=442)
(OpenZFS Basics by Matt Ahrens and George Wilson)

------
lorenzfx
> [...] whereas ZFS, the most prominent candidate for this type of features,
> is not without hassle and it must be recompiled for every kernel update
> (although automation exists).

I have the feeling that this setup is even more of a hassle than ZFS.

~~~
nix23
>I have the feeling that this setup is even more of a hassle than ZFS.

Hehe so true!

ZFS: >dd if=/dev/zero of=/path/to/testfile bs=1M count=18000 conv=fdatasync

That's stupid, writing zeros to ZFS just ~tests the compression speed, your
file will have the size of a bit more than zero.

~~~
hikarudo
Unless compression is disabled.

~~~
Filligree
Which, for a variety of reasons, you should never do on a modern system. Use
at least compression=zle.

~~~
nix23
For movies/pictures and sound (well everything that is already compressed) you
can deactivate it, zle too.

EDIT: On the other hand leave it on, zfs tests the file if it's compress-
able..if not it makes nothing:

[https://klarasystems.com/articles/openzfs1-understanding-
tra...](https://klarasystems.com/articles/openzfs1-understanding-transparent-
compression/)

>>ZFS Compression, Incompressible Data and Performance

>>You may be tempted to set “compression=off” on datasets which primarily have
incompressible data on them, such as folders full of video or audio files. We
generally recommend against this – for one thing, ZFS is smart enough not to
keep trying to compress incompressible data, and to never store data
compressed, if doing so wouldn’t save any on-disk blocks.

------
magicalhippo
> As ZFS did not adhere to conv=fdatasync5, the main memory was restricted to
> 1GB [...]

I guess this might explain the latency tail for ZFS. As far as I understand,
ZFS relies heavily on its caches for performance, both by design and
implementation choices.

I'm also a bit puzzled by the read performance.

Would be nice to see a re-run of the benchmarks with 2 or 4 GB of memory, to
see the effect of the cache.

~~~
fomine3
It makes the benchmark unrealistic. I suspect that all ZFS setup uses at least
2GB RAM.

------
curt15
As I understand it, dm-integrity is supposed to graft zfs-like checksumming
onto any filesystem. How good is its ability to detect corruption and fix them
in a RAID setup?

~~~
magicalhippo
It's not quite the same, dm-integrity[1] only does per-sector checksums, while
the checksums in ZFS form a Merkle tree[2].

[1]: [https://www.kernel.org/doc/html/latest/admin-guide/device-
ma...](https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/dm-
integrity.html)

[2]:
[https://en.wikipedia.org/wiki/ZFS#Data_integrity](https://en.wikipedia.org/wiki/ZFS#Data_integrity)

------
paulmd
> One thing to note is that checks and resyncs of the proposed setup are
> significantly prolonged (by a factor of 3 to 4) compared to an mdraid
> without the integrity-layer underneath. The investigation so far has so far
> not revealed the underlying cause. It is not CPU-bound, indicating that the
> read-performance is not held back by checksumming, the latency figures above
> do not imply an increase in latency significant enough to cause a factor 3-4
> prolonged check-time, disabling the journal did not change this either (as
> one would expect, as the journal is unrelated to read, which should be the
> only relevant mode for a raid-check).

> ZFS was roughly on par with mdraid without the integrity-layer underneath
> with regard to raid-check time.

> If anyone has an idea of the root cause of this behavior, feel encouraged to
> contact me, I’d be intrigued to know.

note that by default ZFS is tuned to prioritize online transactions at the
expense of scrub/resilver speed, as this is a sensible choice for its intended
use-case of business storage appliances. Businesses can't stop the world for a
resilver.

For your average home NAS, you can tune ZFS to scrub/resilver faster, not
really a big deal as you are not really hammering it anyway.

[https://www.reddit.com/r/zfs/comments/6t799g/really_slow_scr...](https://www.reddit.com/r/zfs/comments/6t799g/really_slow_scrub/)

[https://www.ixsystems.com/community/threads/scrub-
performanc...](https://www.ixsystems.com/community/threads/scrub-performance-
tuning.51959/)

------
fomine3
Don't use dd for benchmark, please use fio.

------
deegles
It's funny that when using cloud storage (like S3) the probability of me
losing my data due to a billing issue or user error is way higher than any
underlying technical risk. It's like being afraid of dying in a plane crash
but seeing no issue riding a motorcycle to the airport.

------
piercebot
I've had a lot of good luck with ZFS over the past decade and a half. I just
started building my third file server last week, actually. Going to try it
with NVMe drives this time!

[https://ajpierce.com/2020-09-02_file-server-
pt1/](https://ajpierce.com/2020-09-02_file-server-pt1/)

------
sigstoat
how is the on-disk data repaired once an error is found?

if dm-integrity returns an error to mdraid, will mdraid rewrite the error'd
blocks once it determines the correct value from a mirrored copy, or via
parity?

~~~
magicalhippo
From the man page[1]:

"In kernels prior to about 2.6.15, a read error would cause the same effect as
a write error. In later kernels, a read-error will instead cause md to attempt
a recovery by overwriting the bad block. i.e. it will find the correct data
from elsewhere, write it over the block that failed, and then try to read it
back again. If either the write or the re-read fail, md will treat the error
the same way that a write error is treated, and will fail the whole device."

[1]: [https://linux.die.net/man/4/md](https://linux.die.net/man/4/md)

