Hacker News new | past | comments | ask | show | jobs | submit login

I'm thankful for their OpenZFS tuning doc which they developed as part of this server migration: https://github.com/letsencrypt/openzfs-nvme-databases

The one thing that I get hung up on when it comes to RAID and SSDs is the wear pattern vs. HDDs. Take for example this quote from the README.md:

We use RAID-1+0, in order to achieve the best possible performance without being vulnerable to a single-drive failure.

Failure on SSDs is predictable and usually expressed with Terabytes Written (TBW). Failure on spinning disk HDDs is comparatively random. In my mind, it makes sense to mirror SSD-based vdevs only for performance reasons and not for data integrity. The reason is that the mirrors are expected to fail after the same amount of TBW, and thus the availability/redundancy guarantee of mirroring is relatively unreliable.

Maybe someone with more experience in this area can change my mind, but if it were up to me, I would have configured the mirror drives as spares, and relied on a local HDD-based zpool for quick backup/restore capability. I imagine that would be a better solution, although it probably wouldn't have fit into tryingq's ideal 2U space.

> Failure on SSDs is predictable and usually expressed with Terabytes Written (TBW). Failure on spinning disk HDDs is comparatively random.

That wasn't my experience with thousands of SSDs and spinning drives. Spinning drives failed more often, but usually with SMART sector counts increasing before hand. Our SSDs never got close to media wearout, but that didn't stop them from dropping off the bus. Literally working fine, then boom can't detect; all data gone.

Then there's the incidents where the power on hours value rolls over and kills the firmware. I believe these have happened on disks of all types, but here's a recent one on SSDs [1]. Normally when building a big server, all the disks are installed and powered on at the same time, which risks catastrophic failure in case of a firmware bug like this. If you can, try to get drives from different batches, and stagger the power on times.

[1] https://www.zdnet.com/article/hpe-says-firmware-bug-will-bri...

> Failure on spinning disk HDDs is comparatively random.

comparatively, yes, but when averaged out over a large number of hard drives it definitely tends to follow a typical bathtub curve failure model seen in any mechanical product with moving parts.


early failures will be HDDs that die within a few months of being put into service

in the middle of the curve, there will be a constant steady rate of random failures

towards the end of the lifespan of the hard drives, as they've been spinning and seeking for many years, failures will increase.

SSD’s still fail, just not often.

State of the art systems keep ~1.2 copies (e.g. 10+2 raid 6) on SSD, and an offsite backup or two. The bandwidth required for timely rebuilds is usually the bottleneck.

These systems can be ridiculously dense; a few petabytes easily fits in 10U. With that many NAND packages, drive failures are common.

An early mitigation strategy has been to use different brands of ssd so that failure rates get more spread out. For raid 6 we have started out with max 2 drives of a single brand.

The result for us was 2 drives that failed within the same month, of the same brand, and from there it seems to be single failures rather than clusters.

The tuning is almost identical to what we have in production. A few comments:

* I would probably go with ashift=12 to get better compression ratio, or even as far as ashift=9 if the disks can sustainably maintain the same performance. Benchmark first, of course.

* We came to the same conclusion regarding AIO recently, but just today I did more benchmark, and it looks like ZFS shim does perform better than InnoDB shim. So I think it's still fine to enable innodb_use_native_aio

* We use zstd compression from ZFS 2.0. It's great, and we all deserve it after suffering through the PR dramas.

> The reason is that the mirrors are expected to fail after the same amount of TBW

You could fix that by writing a bit more to one of the disks, e.g. run badblocks for different amounts of time before putting them in service.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact