24 drives. Same model. Likely the same batch. Similar wear. Imagine most of them failing at the same time, and the rest failing as you're rebuilding it due to the increased load, because they're already almost at the same point.
I ordered my NAS drives on Amazon, to avoid getting the same batch (all consecutive serial numbers) I used amazon.co.uk for one half and amazon.de for the other half of them. One could also stage the orders in time.
Software bugs might cause that (eg. drive fails after exactly 1 billion IOPS due to some counter overflowing). But hardware wear probably won't be as consistent.
I've seen this happen to a friend. Back in the noughties they built a home NAS similar to the one in the article, using fewer (smaller) drives. It was in RAID5 configuration. It lasted until one drive died and a second followed it during the rebuild. Granted, it wasn't using ZFS, there was no regular scrubbing, 00s drive failure rates were probably different, and they didn't power it down when not using it. The point is the correlated failure, not the precise cause.
Usual disclaimers, n=1, rando on the internet, etc.
You’re far better off having two raids, one as a daily backup of progressive snapshots that only turns on occasionally to backup and is off the rest of the time.
I don’t understand how it is better to have an occasional (= significantly time-delayed) backup. You’ll lose all changes since the last backup. And you’re doubling the cost, compared to just one extra hard drive for RAID 6.
Really important stuff is already being backed up to a second location anyway.
> This NAS is very quiet for a NAS (video with audio).
Big (large radius) fans can move a lot of air even at low RPM. And be much more energy efficient.
Oxide Computer, in one of their presentations, talks about using 80mm fans, as they are quiet and (more importantly) don't use much power. They observed, in other servers, as much as 25% of the power went just to powering the fans, versus the ~1% of theirs:
Interesting - I'm used to desktop/workstation hardware where 80mm is the smallest standard fan (aside from 40mm's in the near-extinct Flex ATX PSU), and even that is kind of rare. Mostly you see 120mm or 140mm.
Yeah. In a home environment you should absolutely use desktop gear. I have 5 80mm and one 120mm PWM fans in my NAS and they are essentially silent as they can't be heard over the sound of the drives (which is essentially the noise floor for a NAS).
It is necessary to use good PWM fans though if concerned about noise as cheaper ones can "tick" annoyingly. Two brands I know to be good in this respect are Be Quiet! and Noctua. DC would in theory be better but most motherboards don't support it (would require an external controller and thermal sensors I think).
> Those 40mm PSU fans, and the PSU, are what they are replacing with a DC bus bar.
DC (power) in the DC (building) isn't anything new: the telco space has used -48V (nominal) power for decades. Do a search for (say) "NEBS DC power" and you'll get a bunch of stuff on the topic.
Lot's of chassis-based system centralized the AC-DC power supplies.
We also worked with the fan vendor to get parts with a lower minimum RPM. The stock fans idle at about 5K RPM, and ours idle at 2K, which is already enough to keep the system cool under light loads.
> just curious, are you associated with them, as these are very obscure youtube videos :D
Unassociated, but tech-y videos are often recommended to me, and these videos got pushed to me. (I have viewed other, unrelated Tech Day videos, so probably why I got that short. Also an old Solaris admin, so aware of Cantril, especially his rants.)
> Love it though, even the reduction in fan noise is amazing. I wonder why nobody had thought of it before, it seems so simple.
Depends on the size of the server: can't really expand fans with 1U or even 2U pizza boxes. And for general purpose servers, I'm not sure how many 4U+ systems are purchased—perhaps some more now that perhaps GPUs cards may be a popular add-on.
For a while chassis systems (e.g., HP c7000) were popular, but I'm not sure how they are nowadays.
> I'm not sure how many 4U+ systems are purchased—perhaps some more now that perhaps GPUs cards may be a popular add-on.
Going from what i see at eCycle places, 4U dried up years ago. Everything is either 1 or 2U or massive blade receptacles (10+ U).
We (the home-lab on a budget people) may see a return to 4U now that GPUs are in vogue but i'd bet that the hyper scalers are going to drive that back down to something that'll be 3U with water cooling or so over the longer term.
We may also see similar with storage systems too; it's only a matter of time before SSD gets "close enough" to spinning rust on the $/gig/unit-volume metrics.
I’ve heard the exact opposite advice (keep the drives running to reduce wear from power cycling).
Not sure what to believe, but I like having my ZFS NAS running so it can regularly run scrubs and check the data. FWIW, I’ve run my 4 drive system for 10 years with 2 drive failures in that time, but they were not enterprise grade drives (WD Green).
I think a lot of the advice around keeping the drives running is about avoiding wear caused by spin downs and startups i.e. keeping the "Start Stop Cycles" low.
Theres a difference between spinning a drive up/down once or twice a day and spinning it down every 15 minutes or less.
Also WD Green drives are not recommended for NAS usage. I know in the past they used to park the read/write head every few seconds or so which is fine if data is being accessed infrequently but continuously however a server this can result in continuous wear which leads to premature failure.
Agree. I do weekly backups, the backup NAS is only switched on and off 52 times a year. After 5-6 years the disks are probably close to new in term of usage vs disks that have been running continuously over that same period.
Which leads to another strategy which is to swap the primary and the backup after 5 years to get a good 10y out of the two NAS.
I think I know what you are talking about. I don’t know if it made the green drive identical to a red but it turned on TLER so the green drives wouldn’t co tangly drop out of the raid array.
I’m running a 3 disk ZFS mirror of 8-10yr old greens and are yet to have an issues.
* The read/write heads experience literally next to no wear while they are floating above the platters. They physically land onto shelves or onto landing zones on the platters themselves when turned off; landing and takeoff are by far the most wear the heads will suffer.
* Following on the above, in the worst case the read/write heads might be torn off during takeoff due to stiction.
* Bearings will last longer; they might also seize up if left stationary for too long. Likewise the drive motor.
* The rush of current when turning on is an electrical stressor, no matter how minimal.
The only reasons to turn your hard drives off are to save power, reduce noise, or transport them.
Counterpoints for each:
Heads don't suffer wear when parking. The armature does.
If the platters are not spinning fast enough, or the air density is too low: the heads will crash into the sides of the platters.
The main wear on platter bearings is vibration, it takes an extremely long time for the lube to "gum up." If its still a thing at all. I suspect it used to happen because they were petroleum distilate lubes. So, shorter chains would evaporate/sublimate leaving longer more viscous chains. Or straight polymerize.
With fully synthetic PAO oils, and other options they won't do that anymore.
What inrush? They're polyphase steppers. The only reason for inrush is that the engineers didn't think it'd affect lifetime.
Counter: turn your drives off, thebsaved power of 8 drives being off half the day easily totals $80 a year- enough to replace all but the highest capacities.
> Keep them running [...] Bearings will last longer; they might also seize up if left stationary for too long. Likewise the drive motor.
All HDD failures I've ever seen in person (5 across 3 decades), were bearing failures, in machine that were almost always on with drives spun up. It's difficult to know for sure without proper A-B comparisons, but I've never seen a bearing failure in a machine where drives were spun down automatically.
It also seems intuitive that for mechanical bearings the longer they are spun up the greater the wear and the greater the chance of failure.
I think I have lost half a dozen hard drives (and a couple DVD-RW drives) over the decades because they sat in a box for a couple years on a shelf (I recall that one recovered working with a higher amperage 12V supply, but only long enough to copy off most of the data)
> The only reasons to turn your hard drives off are to save power, reduce noise, or transport them.
One reason some of my drives get powered down 99+% of the time is that its a way to guard against the whole network getting cryptolockered. i have a weekly backup run by a script that powers up a pair of raid1 usb drives, does and incremental no-delete backup, then unmounts and powers them back down again. Even in a busy week theyre rarely running for more than an hour or two. I'd have to get unlucky enough to not have the powerup script detect being cryptolockered (it checks md5 hashes of a few "canary files") and powering up the bav=ckup drives anyway. I figure that's a worthwhile reason to spin them down weekly...
No, the only proper way to prevent attacks on the data thereof is to keep a backup that isn't readily accessible. Aka offline, whether that's literally turned off or just airgapped from the rest of the infrastructure.
how will they both encrypt your machines drive, and the NAS backup with snapshots on it, but yet not defeat the air gap? air gapping is not some magical technique that keeps you secure.
if you really want to protect yourself from crypto lock attacks, which seems odd for an individual to be concerned about (nobody extorts individuals for a few hundred dollars to get the keys, they extort businesses for many thousands of dollars) then you need write once media like MDISC or WORM tapes.
Yes - although it's worth bearing in mind the number of load/unload cycles a drive is rated for over its lifetime.
In the case of the IronWolf NAS drives in my home server, that's 600,000.
I spin the drives down after 20 minutes of no activity, which I feel is a good balance between having them be too thrashy and saving energy. After 3 years I'm at about 60,000 load unload cycles.
Hard drives are often configured to spin down when idle for a certain time. This can cause many spinups and spindowns per day. So I don't buy this at all. But I don't have supporting evidence that back up this notion.
There seems to be a huge difference between spin-down while NAS is up vs shutting the whole NAS down and restart. When I start my NAS, it takes a bunch of time to be back up: it seems to do a lot of checking on/ syncing off the drives and puts a fair amount of load on them (same is true for CPU as well, just look at your CPU load right after startup).
OTOH, when the NAS spins up a single disk again, I haven't notice any additional load. Presumably, the read operation just waits until the disk is ready.
> Hard drives are often configured to spin down when idle for a certain time. This can cause many spinups and spindowns per day.
I was under the impression that this was, in fact, known to reduce drive longevity! It is done anyway in order to save power, but the reliability tradeoff is known.
No idea where I read that though, I thought it was "common knowledge" so maybe I'm wrong.
It does, and primarily increases the load / unload cycle count. some WD drives are only rated at a few hundred thousand load / unload cycles. it’s best to buy drives that can handle being on always and leave them on.
This is completely dependant on access frequency. Do you have a bunch of different people accessing many files frequently? Are you doing frequent backups?
If so then yes, keeping them spinning may help improve lifespan by reducing frequent disk jerk. This is really only applicable when you're at a pretty consistent high load and you're trying to prevent your disks from spinning up and down every few minutes or something.
For a homelab, you're probably wasting way more money in electricity than you are saving in disk maintenance by leaving your disks spin.
Does this include some kind of built-in hash/checksum system to record e.g. md5 sums of each file and periodically test them? I have a couple of big drives for family media I'd love to protect with a bit more assurance than "the drive did not fail".
This is not strictly accurate. ZFS records checksums of the records of data that make up the file storage. If you want an end to end file-level checksum (like a SHA-256 digest of the contents of the file) you still need to layer that on top. Which is not to say it's bad, and it's certainly something I rely on a lot, but it's not quite the same!
Discussions on checksumming filesystems usually revolve around ZFS and BTRFS, but has someone any experience with bcachefs? It's upstreamed in the linux kernel, I learned, and is supposed to have full checksumming. The author also seems to take filesystem responsibility seriously.
I tried it out on my homelab server right after the merge into the Linux kernel.
Took roughly one week for the whole raid to stop mounting because of the journal (8hdd, 2 ssd write cache, 2 nvme read cache).
The author responded on Reddit within a day, I tried his fix, (which meant compiling the Linux kernel and booting from that), but his fix didn't resolve the issue. He sadly didn't respond after that, so I wiped and switched back to a plain mdadmin raid after a few days of waiting.
I had everything important backed up, obviously (though I did lose some unimportant data), but it did remind me that bleeding edge is indeed ... Unstable
The setup process and features are fantastic however, simply being able to add a disk and flag it as read/write cache feels great.
I'm certain I'll give it another try in a few years, after it had some time in the oven.
New filesystems seems to have a chicken and egg problem really. It's not like switching from Nvidia's proprietary drivers to nouveau and then back if it turns out they don't work that well. Switching filesystems, especially in larger raid setups where you desperately need more testing and real world usage feedback, is pretty involved, and even if you have everything backed up it's pretty time consuming restoring everything should things go haywire.
And even if you have the time and patience to be one of these early adopters, debugging any issues encountered might also be difficult, as ideally you want to give the devs full access to your filesystem for debugging and attempted fixes, which is obviously not always feasible.
So anything beyond the most trivial setups and usage patterns gets a miniscule amount of testing.
In an ideal world, you'd nail your FS design first try, make no mistakes during implementation and call it a day. I'd like to live in an ideal world.
> In an ideal world, you'd nail your FS design first try, make no mistakes during implementation and call it a day
Crypto implementations and FS implementations strike me as the ideal audience for actually investing the mental energy in the healthy ecosystem we have of modeling and correctness verification systems
Now, I readily admit that I could be talking out of my ass, given that I've not tried to use those verification systems in anger, as I am not in the crypto (or FS) authoring space but AWS uses formal verification for their ... fork? ... of BoringSSL et al https://github.com/awslabs/aws-lc-verification#aws-libcrypto...
A major chunk of storage reliability is all these weird and unexpected failure modes and edge cases which are not possible to prepare for, let alone write fixed specs for. Software correctness assumes the underlying system behaves correctly and stays fixed, which is not the case. You can't trust the hardware and the systems are too diverse - this is the worst case for formal verification.
After reading the email chain I have to say my enthusiasm for bcachefs has diminished significantly. I had no idea Kent was that stubborn and seems to have little respect for Linus or his rules.
As usual, the top comments in that submission are very biased. I think HN should sort comments in a random order in every polarizing discussion. Anyone reading this, do yourself a favor and dig through both links, or ignore the parent's comment altogether.
Linus "regretted" it in the sense "it was a bit too early because bcachefs is moving at such a fast speed", and not in the sense "we got a second btrfs that eats your data for lunch".
Please provide context and/or short human-friendly explanation, because I'm pretty sure most readers won't go further than your comment and will remember it as "Linus regrets merging bcachefs", helping spread FUD for years down the line.
You're saying this like the takeaway of "Linus regrets merging bcachefs" is unfair when the literal quote from Linus is "[...] I'm starting to regret merging bcachefs." And earlier he says "Nobody sane uses bcachefs and expects it to be stable[...]".
I don't understand how you can read Linus' response and think "Linus regrets merging bcachefs" is an unfair assessment.
what attachment to bcachefs do you have? the concerns are valid and at first i didn’t read it ask Linus not wanting another btrfs. But now thinking about it, why do we have another competing filesystem being developed at this point at all?
Well. Point taken. You have an important core of truth to your argument about polarization.
But...
Strongly disagree.
I think that is a very unfair reading of what I wrote. I feel that you might have a bias which shows but that would be the same class of ad hominem as you have just displayed. That is why I choose to react even though it might be wise to let slepping dogs lie. We should minimize polarization but not to a degree where we cannot have civilized disagreement. You are then doing exactly what you preach not to do. Is that then FUD with FUD on top? Two wrongs make a right?
I was reacting on the implicit approval in mentioning that it had been upstreamed in the kernel. The reason for the first link. Regrets where clearly expressed.
Another HN trope is rehashing the same discussions over and over again. That was the reason for the second link. I would like to avoid yet another discussion on a topic which was put into light less than 14 days ago. Putting that more bluntly would have been impolite and polarizing. Yet here I am.
The sad part is that my point got through to you loud and clear. Sad because rather than simply dismissing as polarizing that would have been a great opener for a discussion. Especially in the context of ZFS and durability.
You wrote:
> Linus "regretted" it in the sense "it was a bit too early because bcachefs is moving at such a fast speed", and not in the sense "we got a second btrfs that eats your data for lunch".
If you allow me a little lighthearted response. The first thing which comes to mind was the "They're the same picture" meme[1] from The Office. Some like to move quickly and break things. That is a reasonable point of view. But context matters. For long term data storage I am much more conservative. So while you might disagree; to me it is the exact same picture.
Hence I very much object to what I feel is an ad hominem attack because your own worldview was not reflected suitably in my response. It is fair critique that you feel it is FUD. I do however find it warranted for a filesystem which is marked experimental. It might be the bees knees but in my mind it is not ready for mainstream use. Yet.
That is an important perspective for the OP to have. If the OP just want to play around all is good. If the OP does not mind moving quickly and break things, fine. But for production use? Not there yet. Not in my world.
Telling people to ignore my comment because you know people cannot be bothered to actually read the links? And then lecturing me that people might take the wrong spin on it? Please!
It is marked experimental, and since it was merged into the kernel there have been a few major issues that has been resolved.
I wouldn't risk production data on it, but for a home lab it could be fine.
But you need to ask yourself, how much time are you willing to spend if something should go wrong? I have also been running ZFS for 15+ years, and I've seen a lot of crap because of bad hardware. But with good enterprise hardware it has been working flawless.
i do this on my synoligy using btrfs. i’m still not convinced SSD caching gives any benefit for a home user. 5 spindle drives can already read and write faster than line rate on the NIC (1gbe) so what is the point of adding another failure point?
You manage 5 discs in a device because you care about data protection, being safer than a single disc.
Yes SSDs in theory are faster but you are only as fast as your slowest link, which is the spindle drive. so that cache is a buffer only for frequently read data. in home environments they’re next to useless. in enterprises they’re certainly useful.
> Yes SSDs in theory are faster but you are only as fast as your slowest link, which is the spindle drive. so that cache is a buffer only for frequently read data. in home environments they’re next to useless.
If you check the numbers I gave above, I have 2 TiB SSD and 8 TiB hard disk. My 'frequently read data' is basically all the data I care about accessing. The other 8 TiB is mostly for eg steam games I installed and forgot about or for additional backups of some data from cloud services, like Google Photos. These are mostly write-once-read-never.
And eg if I happen to access a steam game that's currently on the HDD, it will quickly migrate to the SSD.
My 'working set' of data is certainly smaller than 2 TiB.
Five disks are safer than a single disk (if you store things multiple times or with erasure coding), but if you stick all five disks in a single device, the safety gains are rather more limited.
yea so again the single spindle drive is slowing it down. the spindle drive doesn’t get faster because it has an SSD, if you read something from the spindle it will read at the same speed the spindle is rated at. after it’s loaded into the SSD then it’s faster. but only then
> Five disks are safer than a single disk (if you store things multiple times or with erasure coding), but if you stick all five disks in a single device, the safety gains are rather more limited.
There are devices called storage servers or NAS. this idea of not putting more than one spindle in a machine is foreign to me.
I'm optimistic about it, but probably won't switch over my home lab for a while. I've had quirks with my (now legacy) zsys + zfs on root for Ubuntu, but since it's a common config//widely used for years it's pretty easy to find support.
I probably won't use bcachefs until a similar level of adoption/community support exists.
Can't comment on bcachefs (I think it's still early), but I've been running with bcache in production on one "canary" machine for years, and it's been rock-solid.
In my experience the environment where the drives are running makes a huge difference in longevity. There's a ton more variability in residential contexts than in data center (or even office) space. Potential temperature and humidity variability is a notable challenge but what surprised me was the marked effect of even small amounts of dust.
Many years ago I was running an 8x500G array in an old Dell server in my basement. The drives were all factory-new Seagates - 7200RPM and may have been the "enterprise" versions (i.e. not cheap). Over 5 years I ended up averaging a drive failure every 6 months. I ran with 2 parity drives, kept spares around and RMA'd the drives as they broke.
I moved houses and ended up with a room dedicated to lab stuff. With the same setup I ended up going another 5 years without a single failure. It wasn't a surprise that the new environment was better, but it was surprising how much better a cleaner, more stable environment ended up being.
A drive failure every 6 months almost sounds more like dirty power than dust, I’ve always kept my NAS/file servers in dusty residential environments (I have a nice fuzzy gray Synology logo visible right now) and never seen anything like that
Except for the helium-filled ones, they aren't sealed; there is a very fine filter that equalises atmospheric pressure. (This is also why they have a maximum operating altitude --- the head needs a certain amount of atmospheric pressure to float.)
Don't know the details, but dust could have been impeding the effectiveness of his fans or clumping to create other hotspots in the system (including in the PSU).
>Many years ago I was running an 8x500G array in an old Dell server in my basement. The drives were all factory-new Seagates - 7200RPM and may have been the "enterprise" versions (i.e. not cheap). Over 5 years I ended up averaging a drive failure every 6 months. I ran with 2 parity drives, kept spares around and RMA'd the drives as they broke.
Hah! I had a 16x500GB Seagate array and also averaged an RMA every six months. I think there was a firmware issue with that generation.
They're not airtight in the true sense (besides the helium filled ones nowadays), but every drive made in the past... 30? 40 years is airtight in the sense that no dust can ever get into the drive. There's a breather hole somewhere (with a big warning to not cover it!) to equalize pressure, and a filter that doesn't allow essentially any particles in.
Unless you’re moving the altitude of the drive substantially after it’s already clogged, how would this happen? There’s no air exchange on hard drives.
Unless your drives are in a perfectly controlled temperature, humidity, and atmospheric pressure environment, those will all impact the internal pressure. Temperature being the primary concern because drives do get rather warm internally while operating.
Sure, it has some impact, but we’re not talking about anything too crazy. And that also assumes full total clogging of all pores… which is unlikely to happen. You won’t have perfect sealing and pressure will just equalize.
> Losing the system due to power shenanigans is a risk I accept.
There is another (very rare) failure an ups protects against, and that's imbalance in the electricity.
You can get a spike (up or down, both can be destructive) if there is construction in your area and something happens with the electricity, or lightning hits a pylon close enough to your house.
First job I worked at had multiple servers die like that, roughly 10 yrs ago. it's the only time I've ever heard of such an issue however
To my understanding, an ups protects from such spikes as well, as it will die before letting your servers get damaged
I’ve had firsthand experience of a lightning strike hitting some gear that I maintained…
My parent’s house got hit right on the TV antenna, which was connected via coax down the the booster/splitter unit in comms cupboard … then somehow it got onto the nearby network patch panel and fried every wired ethernet controller attached to the network, including those built into switch ports, APs, etc. In the network switch, the current destroyed the device’s power supply too, as it was trying to get to ground I guess.
Still a bit of a mystery how it got from the coax to the cat5. maybe a close parallel run the electricians put in somewhere ?
Total network refit required, but thankfully there were no wired computers on site… I can imagine storage devices wouldn’t have fared very well.
This depends very much on the type of UPS. Big, high dollar UPSes will convert the AC to DC and back to AC, which gives amazing pure sine wave power.
The $99 850VA APC you get from Office Depot does not do this. It switches from AC to battery very quickly, but it doesn't really do power conditioning.
If you can afford the good ones, they genuinely improve reliability of your hardware over the long term. Clean power is great.
Lightning took out a modem and some nearby hardware here about a week ago. Residential. The distribution of dead vs damaged vs nominally unharmed hardware points very directly at the copper wire carrying vdsl. Modem was connected via ethernet to everything else.
I think the proper fix for that is probably to convert to optical, run along a fibre for a bit, then convert back. It seems likely that electricity will take a different route in preference to the glass. That turns out to be disproportionately annoying to spec (not a networking guy, gave up after an hour trying to distinguish products) so I've put a wifi bridge between the vdsl modem and everything else. Hopefully that's the failure mode contained for the next storm.
Mainly posting because I have a ZFS array that was wired to the same modem as everything else. It seems to have survived the experience but that seems like luck.
We’ve had such spikes in an old apartment we were living in. I had no servers back then, but LED lamps annoyingly failed every few weeks. It was an old building from the 60s and our own apartment had some iffy quick fixes in the installation.
Nothing is really going to protect you from a direct lightning strike. Lightning strikes are on the order of millions of volts and thousands of amps. It will arc between circuits that are close enough and it will raise the ground voltage by thousands of volts too. You basically need a lighting rod buried deep into the earth to prevent it hitting your house directly and then you’re still probably going to deal with fried electronics (but your house will survive). Surge protectors are for faulty power supplies and much milder transient events on the grid and maybe a lightning strike a mile or so away.
So I'm still left with int0x29's original question: "Isn't this [an electricity spike that a UPS could protect against] what a surge protector is for?"
Yes. In most cases, assuming you live in a 220V country, a surge protector will absorb the upwards spike, and the voltage range (a universal PSU can go as low as 107V) will handle the brownout voltage dip.
the 'secret' is not that you turn them off. it's simply luck.
I have 4TB HGST drives running 24/7 for over a decade. ok, not 24 but 8, and also 0 failures.
But I'm also lucky, like you.
Some of the people I know have several RMAs with the same drives so there's that.
My main question is: What is it that takes 71TB but can be turned off most of the time? Is this the server you store backups?
It can be luck, but with 24 drives, it feels very lucky. Somebody with proper statistics knowledge can probably calculate the risk with a guestimated 1% yearly failure rate how likely it would be to have all 24 drives remaining.
And remember, my previous NAS with 20 drives also didn't have any failures. So N=44, how lucky must I be?
It's for residential usage, and if I need some data, I often just copy it over 10Gbit to a system that uses much less power and this NAS is then turned off again.
We don't really have to guess. Backblaze posted their stats for 4 TB HGST drives for 2024, and of their 10,000 drives, 5 failed. If OP's 2014 4 TB HGST drives are anything like this, then this is just snake oil and magic rituals and it doesn't really matter what you do.
> If OP's 2014 4 TB HGST drives are anything like this, then this is just snake oil and magic rituals and it doesn't really matter what you do.
It might matter what you do, but we only have public data for people in datacenters. Not a whole lot of people with 10,000 drives are going to have them mostly turned off, and none of them shared their data.
Drives have a bathtub curve, but if you want you can be conservative and estimate first year failure rates throughout. So that's p=5/10000 for drive failure. So chance of no-failure per year (because of our assumption) is 1-p. So, chance of no-failure per ten year is (1-p)^10 or about 99.5%
Those are different drives though, they're MegaScale DC 4000 while OP is using 4 TB Deskstars. Not sure if they're basically the same (probably). I've also had a bunch of these 4TB Megascale drives and absolutely no problems whatsoever in about 10 years as well. Run very cool as well (I think they're 5400 rpm not 7200 rpm).
The main issue with drives like these is that 4 TB is just so little storage compared to 16-20 TB class drives, it kinda gets hard to justify the backplane slot.
it's (1-p)^24^10, where p is drive failure rate per year (assuming it doesn't go up over time). so at 1% that's about 9% or a 1/10 chance of this result. Not exactly great, but not impossible.
The failure rate is not truly random with a nice normal distribution of failures over time. There are sometimes higher rates in specific batches or they can start failing altogether etc.
Backblaze reports always are interesting insights into how consumers drives behave under constant load.
I have a 22TB RAID10 system out in my detached garage that works as an "off-site" backup server for all my other systems. It stays off most of the time. It's on when I'm backing up data to it, or if it's running backups to LTO tape. Or it's on when I'm out in the garage doing whatever project, I use it to play music and look up stuff on the web. Otherwise it's off, most of the time.
There have been drives where power cycling was hazardous. So, whilst I agree to the model, it shouldn't be assumed this is always good, all the time, for all people. Some SSD need to be powered periodically. The duty cycle for a NAS probably meets that burden.
Probably good, definitely cheaper power costs. Those extra grease on the axle drives were a blip in time.
I wonder if backblaze do a drive on-off lifetime stats model? I think they are in the always on problem space.
> There have been drives where power cycling was hazardous.
I know about this story from 30+ years ago. It may have been true then. It may be even true now.
Yet, in my case, I don't power cycle these drives often. At most a few times a month. I can't say or prove it's a huge risk. I only believe it's not. I have accepted this risk for over 15+ years.
Update: remember that hard drives have an option to spin down when idle. So hard drives can handle many spinups a day.
In the early '90s, some Quantum 105S hard drives had a "stiction" problem, and were shipped with Sun SPARCstations.
IME at the time, power off a bunch of workstations, such as for building electrical work, and probably at least one of them wouldn't spin back up the next business day.
Pulling the drive sled, and administering percussive maintenance against the desktop, could work.
I debated posting because it felt like shitstirring. I think overwhelmingly what you're doing is right. And if a remote power on eg WOL works on the device, so much the better. If I could wish for one thing, it's mods to code or documentation of how to handle drive power down on zfs. The rumour mill is zfs doesn't like spindown.
What is there to handle? I have a ZFS array that works just fine with hard drives that automatically spin down. ZFS handles this without an issue.
The main gotchas tend to be: if you use the array for many things, especially stuff that throws off log files, you will constantly be accessing that array and resetting the spin down timers. Or you might be just at the threshold for spindown and you'll put a ton of cycles on it as it bounces from spindown to access to spin up.
For a static file server (rarely accessed backups or media), partitioned correctly, it works great.
I did try using HDD spindown on ZFS but I remember (It's a long time ago) that I encountered too many vague errors that scared me and I just disabled spindown all together.
Long ago I had a client who could have been an episode of "IT Nightmares".
They used internal 3.5" hard drives along with USB docks to backup a couple Synology devices...It seemed like 1/10 times when you put a drive back in the dock to restore a file or make another backup, the drive wouldn't power back up.
We've been using a multi-TB PostgreSQL database on ZFS for quite a few years in production and have encountered zero problems so far, including no bit flips. In case anyone is interested, our experience is documented here:
Regarding the intermittent power cutoffs during boot it should be noted the drives pull power from the 5V rail on startup: comparable drives typically draw up to 1.2A. Combined with the maximum load of 25A on the 5V rail (Seasonic Platinum 860W), it's likely you'll experience power failures during boot if staggered spinup is not used.
I have a similar-sized array which I also only power on nightly to receive backups, or occasionally when I need access to it for a week or two at a time.
It's a whitebox RAID6 running NTFS (tried ReFS, didn't like it), and has been around for 12+ years, although I've upgraded the drives a couple times (2TB --> 4TB --> 16TB) - the older Areca RAID controllers make it super simple to do this. Tools like Hard Disk Sentinel are awesome as well, to help catch drives before they fail.
I have an additional, smaller array that runs 24x7, which has been through similar upgrade cycles, plus a handful of clients with whitebox storage arrays that have lasted over a decade. Usually the client ones are more abused (poor temperature control when they delay fixing their serveroom A/C for months but keep cramming in new heat-generating equipment, UPS batteries not replaced diligently after staff turnover, etc...).
Do I notice a difference in drive lifespan between the ones that are mostly-off vs. the ones that are always-on? Hard to say. It's too small a sample size and possibly too much variance in 'abuse' between them. But definitely seen a failure rate differential between the ones that have been maintained and kept cool, vs. allowed to get hotter than is healthy.
I can attest those 4TB HGST drives mentioned in the article were tanks. Anecdotally, they're the most reliable ones I've ever owned. And I have a more reasonable sample size there as I was buying dozens at a time for various clients back in the day.
Having 24 drives probably offers some performance advantages, but if you don‘t require them, having a 6-bay NAS with 18TB disks instead woukld offer a ton of advantages in terms of power usage, noise, space required, cost and reliability.
I currently support many NAS servers in the 50TB - 2PB range, many of them being 10, 12, and up to 15 years old for some of them. Most of them still run with their original power supplies, motherboard and most of their original (HGST -- now WD -- UlstraStar) drives, though of course a few drives have failed for some of them (but not all).
2, 4, 8TB HGST UltraStar disks are particularly reliable. All of my desktop PCs currently hosts mirrors of 2009 vintage, 2 TB drives that I got when they're put out of service. I have heaps of spare, good 2 TB drives (and a few hundreds still running in production after all these years).
For some reason 14TB drives seem to have a much higher failure rate than Helium drives of all sizes. On a fleet of only about 40 14 TB drives, I had more failures than on a fleet of over 1000 12 and 16 TB.
I've had the exact same NAS for over 15 years. It's had 5 hard drives replaced, 2 new enclosures and 1 new power supply, but it's still as good as new...
I'm curious what's your use case for 71TB of data where you can also shut it down most of the time?
My NAS is basically constantly in use, between video footage being dumped and then pulled for editing, uploading and editing photos, keeping my devices in sync, media streaming in the evening, and backups from my other devices at night..
I have a mini PC + 4x external HDDs (I always bought used) on Windows 10 with ReFS since probably 2016 (recently upgraded to Win 11), maybe earlier. I don't bother powering off.
The only time I had problems is when I tried to add a 5th disk using a USB hub, which caused drives attached to the hub get disconnected randomly under load. This actually happened with 3 different hubs, so I since stopped trying to expand that monstrosity and just replace drives with larger ones instead. Don't use hubs for storage, majority of them are shitty.
Currently ~64TiB (less with redundancy).
Same as OP. No data loss, no broken drives.
A couple of years ago I also added an off-site 46TiB system with similar software, but a regular ATX with 3 or 4 internal drives because the spiderweb of mini PC + dangling USBs + power supplies for HDDs is too annoying.
A typical 7200 rpm disk consumes about 5W when idle. For 24 drives, it's 120W. Rather substantial, but not an electric kettle level. At $0.25 / kWh, it's $0.72 / day, or about $22 / mo, or slightly more than $260 / year. But this is only the disks; the CPU + mobo can easily consume half as much on average, so it would be more like $30-35 / mo.
And if you have electricity at a lower price, the numbers change accordingly.
This is why my ancient NAS uses 5400 RPM disks, and a future upgrade could use even slower disks if these were available. The reading bandwidth is multiplied by the number of disks involved.
On the original blog it states that the machine used 200W idle. Thats 4,8 KWh a day. 17520 KWh over 10 years? At around 0.30 euro per KWh that's 5K+ if I'm not mistaken.
I'm confused about optimizing 7 watts as important -- rough numbers, 7 watts is 61 kWh/y. If you assume US-average prices of $0.16/kWh that's about $10/year.
edit: looks like for the netherlands (where he lives) this is more significant -- $0.50/kWh is the average price, so ~$32/year
Let me tell you powering these drives on and off is far more dangerous then just keeping them running. 10 years is well in the MTBF of these enterprise drives. (I worked for 10 years as enterprise storage technician, i saw a lot if sh*).
My takeaway is that there is a difference between residential and industrial usage, just as there is a difference between residential car ownership and 24/7 taxi / industrial use
And that no matter how amazing the industrial revolution has been, we can build reliability at the residential level but not the industrial level.
And certainly at the price points.
The whole “At FAANG scale” is a misnomer - we aren’t supposed to use residential quality (possibly the only quality) at that scale - maybe we are supposed to park our cars in our garages and drive them on a Sunday
Maybe we should keep our servers at home, just like we keep our insurance documents and our notebooks
Regular reminder: RAID (and ZFS) don't replace backups. It's an availability solution to reduce downtime in event of disk failure. Many things can go wrong with your files and filesystem besides disk failure, eg user error, userspace software/script bugs, driver or FS or hardware bugs, ransomware, etc)
The article mentions backups near the end saying eg "most of the data is not important" and the "most important" data is backed up. Feeling lucky I guess.
ZFS can help you with backups and data integrity beyond what RAID provides, though. For example, I back up to another machine using zfs's snapshot sending feature. Fast and convenient. I scrub my machine and the backup machine every week, so if any data has become damaged beyond repair on my machine, I know pretty quickly. Same with the backup machine. And because of the regular integrity checking on my machine, it's very unlikely that I accidentally back up damaged data. And finally, frequent snapshots are a great way to recover from software and some user errors.
Of course, there are still dangers, but ZFS without backup is a big improvement over RAID, and ZFS with backups is a big improvement over most backup strategies.
Around 12 years ago I helped design and set up a 48-drive, 9U, ~120TB NAS in the Chenbro RM91250 chassis (still going strong! but plenty of drive failures along the way...). This looks like it's probably the 24-drive/4U entry in the same line (or similar). IIRC the fans were very noisy in their original hot-swappable mounts but replacing them with fixed (screw) mounts made a big difference. I can't tell from the picture if this has hot-swappable fans, though - I think I remember ours having purple plastic hardware.
> It's possible to create the same amount of redundant storage space with only 6-8 hard drives with RAIDZ2 (RAID 6) redundancy.
I've given up on striped RAID. Residential use requires easy expandability to keep costs down. Expanding an existing parity stripe RAID setup involves failing every drive and slowly replacing them one by one with bigger capacity drives while the whole array is in a degraded state and incurring heavy I/O load. It's easier and safer to build a new one and move the data over. So you pretty much need to buy the entire thing up front which is expensive.
Btrfs has a flexible allocator which makes expansion easier but btrfs just isn't trustworthy. I spent years waiting for RAID-Z expansion only for it to end up being a suboptimal solution that leaves the array in some kind of split parity state, old data in one format and new data in another format.
It's just so tiresome. Just give up on the "storage efficiency" nonsense. Make a pool of double or triple mirrors instead and call it a day. It's simpler to set up, easier to understand, more performant, allows heterogeneous pools of drives which lowers risk of systemic failure due to bad batches, gradual expansion is not only possible but actually easy and doesn't take literal weeks to do, avoids loading the entire pool during resilvering in case of failures, and it offers so much redundancy the only way you'll lose data is if your house literally burns down.
I dislike that article/advice because it’s dishonest / downplaying a limitation of ZFS and advocating that people should spend a lot more money, that may likely not be necessary at all.
If you prefer to own media instead of streaming it, are into photography, video editing, 3D modelling, any AI-related stuff (models add up) or are a digital hoarder/archivist you blow through storage rather quickly. I'm sure there are some other hobbies that routinely work with large file sizes.
Storage is cheap enough that rather than deleting 1000's of photos and never be able to reclaim or look at them again I'd rather buy another drive. I'd rather have a RAW of an 8 year old photo that I overlooked and decide I really like and want to edit/work with than a 87kb resized and compressed JPG of the same file. Same for a mostly-edited 240GB video file. What if I want or need to make some changes to it in the future? May as well hold onto it than have to re-edit the video or re-shoot the video if the original footage was also deleted.
Content creators have deleted their content often enough that if I enjoyed a video and think future me might enjoy rewatching the video - I download it rather than trust that I can still watch it in the future. Sites have been taken offline frequently enough that I download things. News sites keep restructuring and breaking all their old article links so I download the articles locally. JP artists are notorious for deleting their entire accounts and restarting under a new alias that I routinely archive entire Pixiv/Twitter accounts if I like their art as there is no guarantee it will still be there to enjoy the next day.
It all adds up and I'm approaching 2 million well-organized and (mostly) tagged media files in my Hydrus client [0]. I have many scripts to automate downloading and tagging content for these purposes. I very, very rarely delete things. My most frequent reason for deleting anything is "found in higher quality" which conceptually isn't really deleting.
Until storage costs become unreasonable I don't see my habits changing anytime soon. On the contrary - storage keeps getting cheaper and cheaper and new formats keep getting created to encode data more and more efficiently.
This is a drop in the bucket for photographers, videographers, and general backups of RAW / high resolution videos from mobile devices. 80TB [usable] was "just enough" for my household in 2016.
It's still not enough to hold a local copy of sci-hub, but could probably hold quite a few recorded conference talks (or similar multimedia files) or a good selection of huggingface models.
I wish he had talked more about his movie collection. I’m interested in the methods of selecting initial items as well as ones that survive in the collection for 10+ years.
I don’t know about other people but I run Plex because it lets me run my home movie collection through multiple clients (Apple TV, browser, phones, etc). iTunes works great with content bought from Apple, but is useless when playing your own media files from other sources.
I just want every Disney movie available so my kids can watch without bothering me.
It's always nice to know that people can store their data for so long. In my research lab, we still only use separate external HDD drives due to budget reasons. Last year 4 (over 8) drives failed and we lost the data. I guess we mainly work with public data so it is not a big deal. But, it is a dream of mine to research without such worries. I do keep backups for my stuff though, but only me in my lab.
Regarding the custom PID controller script: I could have sworn the Linux kernel had a generic PID controller available as a module, which you could setup via the device tree, but I can't seem to find it! (grepping for 'PID' doesn't provide very helpful results lol).
I think it was used on nVidia Tegra systems, maybe? I'd be interested to find it again, if anyone knows. :)
I feel like 10 years is when my drives started failing the most.
I run a 8x8tb array zraid2 redundancy, initially it was a 8x2tb array but drives started failing once every 4 months, after 3 drives failed I upgraded the remaining ones.
Only downside to hosting your own is power consumption. OS upgrades have been surprisingly easy.
Because a lot of people pointed out that it's not that unlikely that all 24 drives survive without failure over 10 years, given the low failure rate reported by Backblaze Drive Stats Reports, I've also updated the article with that notion.
I know it's not a guarantee of no-corruption, and ZFS without ECC is probably no more dangerous than any other file system without ECC, but if data corruption is a major concern for you, and you're building out a pretty hefty system like this, I can't imagine not using ECC.
Slow on-disk data corruption resulting from gradual and near-silent RAM failures may be like doing regular 3-2-1 backups -- you either mitigate against the problem because you've been stung previously, or you're in that blissful pre-sting phase of your life.
EDIT: I found TFA's link to the original build out - and happily they are in fact running a Xeon with ECC. Surprisingly it's a 16GB box (I thought ZFS was much hungrier on the RAM : disk ratio.) Obviously it hasn't helped for physical disk failures, but the success of the storage array owes a lot to this component.
The system is using ECC and I specifically - unrelated to ZFS - wanted to use ECC memory to reduce risk of data/fs corruption. I've also added 'ecc' to the original blog post to clarify.
Edit: ZFS for home usage doesn't need a ton of RAM as far as I've learned. There is the 1 GB of RAM per 1TB of storage rule of thumb, but that was for a specific context. Maybe the ill-fated data deduplication feature, or was it just to sustain performance?
Thanks, and all good - it was my fault for not following the link in this story to your post about the actual build, before starting on my mini-rant.
I'd heard the original ZFS memory estimations were somewhat exuberant, and recommendations had come down a lot since the early days, but I'd imagine given your usage pattern - powered on periodically - a performance hit for whatever operations you're doing during that time wouldn't be problematic.
I used to use mdadm for software RAID, but for several years now my home boxes are all hardware RAID. LVM2 provides the other features I need, so I haven't really ever explored zfs as a replacement for both - though everyone I know that uses it, loves it.
It's difficult as a home user to find ECC memory, harder to make sure it actually works in your hardware configuration, and near-impossible to find ECC memory that doesn't require lower speeds than what you can get for $50 on amazon.
I would very much like to put ECC memory in my home server, but I couldn't figure it out this generation. After four hours I decided I had better things to do with my time.
Indeed. I'd started to add an aside to the effect of 'ten years ago it was probably easier to go ECC'. I'll add it here instead.
A decade ago if you wanted ECC your choice was basically Xeon, and all() Xeon motherboards would accept ECC.
I agree that these days it's much more complex, since you are ineluctably going get sucked into the despair-spiral of trying to work out what combination of Ryzen + motherboard + ECC RAM will give you actual, demonstrable* ECC (with correction, not just detection).
Sounds like the answer is to just buy another Xeon then, even if it's a little older and maybe secondhand. I think there's a reason the vast majority of Supermicro motherboards are still just Intel only.
Accidentally unplugged my raid 5 array and thought I damaged the raid card. Hours after boot I’d get problems. I glitched a RAM chip and the array was picking it up as disk corruption.
For what it's worth this isn't that uncommon. Most drives fail in the first few years, if you get through that then annualized failure rates are about 1-2%.
I've had the (small) SSD in a NAS fail before any of the drives due to TBW.
My home NAS is about 200TB, runs 24/7, is very loud and power inefficient, does a full scrub every Sunday, and also hasn’t had any drive failures. It’s only been 4 or 5 years, however.
Yeah, this definitely caused me to raise an eyebrow. UPS covers brown outs and obviously the occasional temporary power outage. All those drives spinning at full speed suddenly coming to a grinding halt as the power is suddenly cut, and you're quibbling over a paltry additional 10 watts? I can only assume that the data is not that important.
As part of resilience testing I've been turning off our 24 drive backup drive array daily for two years, by flicking the wall switch. So far nothing happened.
I’m guessing that the 71 TiB is mostly used for media, as in, plex/jellyfin media, which is sad to loose but not unrecoverable. How would one ever store that much of personal data? I hope they have an off site backup for the all important unrecoverable data like family photos and whatnot.
I have about about 80TB (my wife's data and mine) backed up to LTO5 tape. It's pretty cheap to get refurb tape drives on ebay. I pay about $5.00/TB for tape storage, not including the ~$200 for the LTO drive and an HBA card, so it was pretty economical.
I get email alerts from ebay for anything related to LTO-5, and I only buy the cheap used tapes. They are still fine, most of them are very low use, the tapes actually have a chip in them that stores usage data like a car's odometer, so you can know how much a tape has been used. So far I trust tapes more than I'd trust a refurb hard drive for backups. And I also really like that the tapes have a write-protect notch on them, so once I write my backup data, there's no risk of having a tape accidentally get erased, unlike if I plugged in a hard drive and maybe there's some ransomware virus that automatically fucks with any hard drive drive that gets plugged in. It's just one less thing to worry about.
I don't have a 10Gb switch. I connect this server directly to two other machines as they all have 2 x 10Gbit. This NAS can saturate 10Gbit but the other side can't, so I'm stuck at 500-700 MB/s, I haven't measured it in a while.
I run a similar but less sophisticated setup. About 18 TiB now, and I run it 16 hours a day. I let it sleep 8 hours per night so that it's well rested in the morning. I just do this on a cron because I'm not clever enough to SSH into a turned off (and unplugged!) machine.
4 drives: 42k hours (4.7 years), 27k hours (3 years), 15k hours (1.6 years), and the last drive I don't know because apparently it isn't SMART.
0 errors according to scrub process.
... but I guess I can't claim 0 HDD failures. There has been 1 or 2, but not for years now. Knock on wood. No data loss because of mirroring. I just can't lose 2 in a pair. (Never run RAID5 BTW, lost my whole rack doing that)
I use a wifi controlled powerpoint to power up and down a pair of raid1 backup drives.
A weekly cronjob on another (always on) machine does some simple tests (md5 checksums of "canary files" on a few machines on the network) then powers up and mounts the drives, runs an incremental backup, waits for it to finish, then unmounts and powers them back down. (There's also a double-check cronjob that runs 3 hours later that confirms they are powered down, and alerts me if they aren't. My incrementals rarely take more than an hour.)
Just among my "personal" stuff, over the last 12 years I've completely lost 4 hard drives due to age/physical failure. ZFS made it a no-deal twice, and aided with recovery once (once I've dd'd what was left of one drive, zvol snapshots made "risky" experimentation cheap&easy).
I have a similar approach but I don’t use ZFS. It’s a bit superfluous especially if you’re using your storage periodically (turn on and off). I use redundant NVMEs in two stages and periodically save important data into multiple HDDs (cold storage). Worth noting, it’s important to prune your data.
I also do not backup photos and videos locally. It’s a major headache and they just take up a crap ton of space when Amazon Prime will give you photo storage for free.
Anecdotally, only drives that failed on me were enterprise-grade HDDs. And they all failed within a year and in an always-on system. I also think RAIDs are over-utilized and frankly a big money pit outside of enterprise-level environments.
>but for residential usage, it's totally reasonable to accept the risk.
Polite disagree. Data integrity is the natural expectation humans have from computers, and thus we should stick to filesystems with data checksums such as ZFS, as well as ECC memory.
> we should stick to filesystems with data checksums such as ZFS, as well as ECC memory.
While I don't disagree with this statement, consider the reality:
- APFS has metadata checksums, but no data checksums. WTF Apple?
- Very few Linux distributions ship zfs.ko (&spl.ko); those that do, theoretically face a legal risk (any kernel contributor could sue them for breaching the GPL); rebuilding the driver from source is awkward (even with e.g. DKMS), pulls more power, takes time, and may randomly leave your system unbootable (YES it happened to me once).
- Linux itself explicitly treats ZFS as unsupported; loading the module taints the kernel.
- FreeBSD is great, and is actually making great progress catching up with Linux on the desktop. Still, it is a catch-up game. I also don't want to install a system that needs to install another guest system to actually run the programs I need.
- There are no practical alternatives to ZFS that even come close; sibling comment complains about btrfs data loss. I never had the guts to try btrfs in production after all the horror stories I've heard over the decade+.
- ECC memory on laptops is practically unheard of, save for a couple niche Thinkpad models; and comes with large premiums on desktops.
What are the practical choices for people who do not want to cosplay as sysadmins?
As a rule of thumb (IMHO) steer clear of commodity NAS Cloud add ons, such things attract ransomware hackers like flies to a tip whether it's QNAP, Synology, or InsertVendorHere.
I feel that on HN people tend to be a bit pedantic about topics like data integrity, and in business settings I actually agree with them.
But for residential use, risks are just different and as you point out, you have no options except to only use a desktop workstation with ECC. People like/need laptops so that’s not realistic for most people. Just run Linux/freebsd with ZFS isn’t reasonable advice to me.
What I feel most strongly about is that it’s all about circumstances, context and risk evaluation. And I see so much blanket absolutist statements that don’t think about the reality of life and people’s circumstances.
“Tainting” the kernel doesn’t affect operations, though. You’re not allowed to redistribute it with changes — but you, as an entity, can freely use ZFS and the kernel together without restriction.
Linux plus zfs works fine.
> - Very few Linux distributions ship zfs.ko (&spl.ko); those that do, theoretically face a legal risk (any kernel contributor could sue them for breaching the GPL)
> - Linux itself explicitly treats ZFS as unsupported; loading the module taints the kernel.
So modify the ZFS source so it appears as a external GPL module... just don't tell anyone or distribute it...
I can't say much about dracut or having to build the module from source... as a Gentoo user, I do it about once a month without any issues...
> What are the practical choices for people who do not want to cosplay as sysadmins?
The specific complaint is not at all about the kernel identifying itself as tainted, the specific complaint is about the kernel developers' unyielding unwillingness to support any scenario where ZFS is concerned, thus leaving one with even more "sysadmin duties". I want to use my computer, not serve it.
> I never had the guts to try btrfs in production after all the horror stories I've heard over the decade+.
I've been running btrfs as the primary filesystem for all of my desktop machines since shortly after the on-disk format stabilized and the extX->btrfs in-place converter appeared [0], and for my home servers for the past ~five years. In the first few years after I started using it on my desktop machines, I had four or five "btrfs shit the bed and trashed some of my data" incidents. I've had zero issues in the past ~ten years.
At $DAYJOB we use btrfs as the filesystem for our CI workers and have been doing so for years. Its snapshot functionality makes creating the containers for CI jobs instantaneous, and we've had zero problems with it.
I can think of a few things that might separate me from the folks who report issues that they've had within the past five-or-ten years:
* I don't use ANY of the built-in btrfs RAID stuff.
* I deploy btrfs ON TOP of LVM2 LVs, rather than using its built-in volume management stuff. [1]
* I WAS going to say "I use ECC RAM", but one of my desktop machines does not and can never have ECC RAM, so this isn't likely a factor.
The BTRFS features I use at home are snapshotting (for coherent point-in-time backups), transparent compression, the built-in CoW features, and the built-in checksumming features.
At work, we use all of those except for compression, and don't use snapshots for backup but for container volume cloning.
[0] If memory serves, this was around the time when the OCZ Vertex LE was hot, hot shit.
[1] This has actually turned out to be a really cool decision, as it has permitted me to do low- or no- downtime disk replacement or repartitioning by moving live data off of local PVs and on to PVs attached via USB or via NBD.
>At $DAYJOB we use btrfs as the filesystem for our CI workers and have been doing so for years. Its snapshot functionality makes creating the containers for CI jobs instantaneous, and we've had zero problems with it.
I think that the author may not have experienced these sorts of errors before.
Yes, the average person may not care about experiencing a couple of bit flips per year and losing the odd pixel or block of a JPEG, but they will care if some cable somewhere or transfer or bad RAM chip or whatever else manages to destroy a significant amount of data before they notice it.
I was young and only had a deskop, so all my data was there.
So I purchased a 300GB external usb drive to use for periodic backup. It was all manual copy/paste files across with no real schedule, but it was fine for the time and life was good.
Over time my data grew and the 300GB drive wasn't large enough to store it all. For a while some of it wasnt backed up (I was young with much less disposable income).
Eventually I purchased a 500GB drive.
But what I didn't know is my desktop drive was dying. Bits were flipping, a lot of them.
So when I did my first backup with the new drive I copied all my data off my desktop along with the corruption.
It was months before I realised a huge amount of my files were corrupted. By that point I'd wiped the old backup drive to give to my Mum to do her own backups. My data was long gone.
Once I discovered ZFS I jumped on it. It was the exact thing that would have prevented this because I could have detected the corruption when I purchased the new backup drive and did the initial backup to it.
(I made up the drive sizes because I can't remember, but the ratios will be about right).
There’s something disturbing about the idea of silent data loss, it totally undermines the peace of mind of having backups. ZFS is good, but you can also just run rsync periodically with checksum and dryrun args and check the output for diffs.
It happens all the time. Have a plan, perform fire drills. It's a lot of time and money, but there's no equivalent feeling to unfucking yourself quite like being able to get your lost, fragile data back.
The challenge with silent data loss is your backups will eventually not have the data either - it will just be gone, silently.
After having that happen a few times (pre-ZFS), I started running periodic find | md5sum > log.txt type jobs and keeping archives.
It’s caught more than a few problems over the years, and allows manual double checking even when using things like ZFS. In particular, some tools/settings just aren’t sane to use to copy large data sets, and I only discovered that when… some of it didn’t make it to it’s destination.
> Data integrity is the natural expectation humans have from computers
I've said it once, and I'll say it again: the only reason ZFS isn't the norm is because we all once lived through a primordial era when it didn't exist. No serious person designing a filesystem today would say it's okay to misplace your data.
Not long ago, on this forum, someone told me that ZFS is only good because it had no competitors in its space. Which is kind of like saying the heavyweight champ is only good because no one else could compete.
To paraphrase, "ZFS is the worst filesystem, except for all those other filesystems that have been tried from time to time."
It's far from perfect, but it has no peers.
I spent many years stubbornly using btrfs and lost data multiple times. Never once did the redundancy I had supposedly configured actually do anything to help me. ZFS has identified corruption caused by bad memory and a bad CPU and let me know immediately which files were damaged.
> No serious person designing a filesystem today would say it's okay to misplace your data.
Former LimeWire developer here... the LimeWire splash screen at startup was due to experiences with silent data corruption. We got some impossible bug reports, so we created a stub executable that would show a splash screen while computing the SHA-1 checksums of the actual application DLLs and JARs. Once everything checked out, that stub would use Java reflection to start the actual application. After moving to that, those impossible bug reports stopped happening. With 60 million simultaneous users, there were always some of them with silent disk corruption that they would blame on LimeWire.
When Microsoft was offering free Win7 pre-release install ISOs for download, I was having install issues. I didn't want to get my ISO illegally, so I found a torrent of the ISO, and wrote a Python script to download the ISO from Microsoft, but use the torrent file to verify chunks and re-download any corrupted chunks. Something was very wrong on some device between my desktop and Microsoft's servers, but it eventually got a non-corrupted ISO.
It annoys me to no end that ECC isn't the norm for all devices with more than 1 GB of RAM. Silent bit flips are just not okay.
Edit: side note: it's interesting to see the number of complaints I still see from people who blame hard drive failures on LimeWire stressing their drives. From very early on, LimeWire allowed bandwidth limiting, which I used to keep heat down on machines that didn't cool their drives properly. Beyond heat issues that I would blame on machine vendors, failures from write volume I would lay at the feet of drive manufacturers.
Though, I'm biased. Any blame for drive wear that didn't fall on either the drive manufacturers or the filesystem implementers not dealing well with random writes would probably fall at my feet. I'm the one who implemented randomized chunk order downloading in order to rapidly increase availability of rare content, which would increase the number of hard drive head seeks on non-log-based filesystems. I always intended to go back and (1) use sequential downloads if tens of copies of the file were in the swarm, to reduce hard drive seeks and (2) implement randomized downloading of rarest chunks first, rather than the naive randomization in the initial implementation. I say naive, but the initial implementation did have some logic to randomize chunk download order in a way to reduce the size of the messages that swarms used to advertise which peers had which chunks. As it turns out, there were always more pressing things to implement and the initial implementation was good enough.
(Though, really, all read-write filesystems should be copy-on-write log-based, at least for recent writes, maybe having some background process using a count-min-sketch to estimate locality for frequently read data and optimize read locality for rarely changing data that's also frequently read.)
Edit: Also, it's really a shame that TCP over IPv6 doesn't use CRC-32C (to intentionally use a different CRC polynomial than Ethernet, to catch more error patterns) to end-to-end checksum data in each packet. Yes, it's a layering abstraction violation, but IPv6 was a convenient point to introduce a needed change. On the gripping hand, it's probably best in the big picture to raise flow control, corruption/loss detection, retransmission (and add forward error correction) in libraries at the application layer (a la QUIC, etc.) and move everything to UDP. I was working on Google's indexing system infra when they switched transatlantic search index distribution from multiple parallel transatlantic TCP streams to reserving dedicated bandwidth from the routers and blasting UDP using rateless forward error codes. Provided that everyone is implementing responsible (read TCP-compatible) flow control, it's really good to have the rapid evolution possible by just using UDP and raising other concerns to libraries at the application layer. (N parallel TCP streams are useful because they typically don't simultaneously hit exponential backoff, so for long-fat networks, you get both higher utilization and lower variance than a single TCP stream at N times the bandwidth.)
It sounds like a fun comp sci exercise to optimise the algo for randomised block download to reduce disk operations but maintain resilience. Presumably it would vary significantly by disk cache sizes.
It's not my field, but my impression is that it would be equally resilient to just randomise the start block (adjust spacing of start blocks according to user bandwidth?) then let users just run through the download serially; maybe stopping when they hit blocks that have multiple sources and then skipping to a new start block?
It's kinda mindbogglingly to me too think of all the processes that go into a 'simple' torrent download at the logical level.
If AIs get good enough before I die then asking it to create simulations on silly things like this will probably keep me happy for all my spare time!
For the completely randomized algorithm, my initial prototype was to always download the first block if available. After that, if fewer than 4 extents (continuous ranges of available bytes) were downloaded locally, randomly chose any available block. (So, we first get the initial block, and 3 random blocks.) If 4 or more extents were available locally, then always try the block after the last downloaded block, if available. (This is to minimize disk seeks.) If the next block isn't available, then the first fallback was to check the list of available blocks against the list of next blocks for all extents available locally, and randomly choose one of those. (This is to chose a block that hopefully can be the start of a bunch of sequential downloads, again minimizing disk seeks.) If the first fallback wasn't available, then the second fallback was to compute the same thing, except for the blocks before the locally available extents rather than the blocks after. (This is to avoid increasing the number of locally available extents if possible.) If the second fallback wasn't available, then the final fallback was to randomly uniformly pick one of the available blocks.
Trying to extend locally available extents if possible was desirable because peers advertised block availability as pairs of <offset, length>, so minimizing the number of extents minimized network message sizes.
This initial prototype algorithm (1) minimized disk seeks (after the initial phase of getting the first block and 3 other random blocks) by always downloading the block after the previous download, if possible. (2) Minimized network message size for advertising available extents by extending existing extents if possible.
Unfortunately, in simulation this initial prototype algorithm biased availability of blocks in rare files, biasing in favor of blocks toward the end of the file. Any bias is bad for rapidly spreading rare content, and bias in favor of the end of the file is particularly bad for audio and video file types where people like to start listening/watching while the file is still being downloaded.
Instead, the algorithm in the initial production implementation was to first check the file extension against a list of extensions likely to be accessed by the user while still downloading (mp3, ogg, mpeg, avi, wma, asf, etc.).
For the case where the file extension indicates the user is unlikely to access the content until the download is finished (the general case algorithm), look at the number of extents (continuous ranges of bytes the user already has). If the number of extents is less than 4, pick any block randomly from the list of blocks that peers were offering for download. If there are 4 or more extents available locally, for each end of each extent available locally, check the block before it and the block after it to see if they're available for download from peers. If this list of available adjacent blocks is non-empty, then randomly chose one of those adjacent blocks for download. If the list of available adjacent blocks is empty, then uniformly randomly chose from one of the blocks available from peers.
In the case of file types likely to be viewed while being downloaded, it would download from the front of the file until the download was 50% complete, and then randomly either download the first needed block, or else use the previously described algorithm, with the probability of using the previous (randomized) algorithm increasing as the percentage of the download completed increased. There was also some logic to get the last few chunks of files very early in the download for file formats that required information from a file footer in order to start using them (IIRC, ASF and/or WMA relied on footer information to start playing).
Internally, there was also logic to check if a chunk was corrupted (using a Merkle tree using the Tiger hash algorithm). We would ignore the corrupted chunks when calculating the percentage completed, but would remove corrupted chunks from the list of blocks we needed to download, unless such removal resulted in an empty list of blocks needed for download. In this way, we would avoid re-downloading corrupted blocks unless we had nothing else to do. This would avoid the case where one peer had a corrupted block and we just kept re-requesting the same corrupted block from the peer as soon as we detected corruption. There was some logic to alert the user if too many corrupted blocks were detected and give the user options to stop the download early and delete it, or else to keep downloading it and just live with a corrupted file. I felt there should have been a third option to keep downloading until a full-but-corrupt download was had, retry downloading every corrupt block once, and then re-prompt the user if the file was still corrupt. However, this option would have resulted in more wasted bandwidth and likely resulted in more user frustration due to some of them hitting "keep trying" repeatedly instead of just giving up as soon as it was statistically unlikely they were going to get a non-corrupted download. Indefinite retries without prompting the user were a non-starter due to the amount of bandwidth they would waste.
The reason ZFS isn't the norm is because it historically was difficult to set up. Outside of NAS solutions, it's only since Ubuntu 20.04 it has been supported out of the box on any high profile customer facing OS. The reliability of the early versions was also questionable, with high zsys cpu usage and some times arcane commands needed to rebuild pools. Anecdotally, I've had to support lots of friends with zfs issues, never so with other file systems. The data always comes back, it's just that it needs petting.
Earlier, there used to be lot of fears around the license, with Torvalds advising against its use, both for that reason and for lack of maintainers. Now i believe that has been mostly ironed out and should be less of an issue.
> The reason ZFS isn't the norm is because it historically was difficult to set up. Outside of NAS solutions, it's only since Ubuntu 20.04 it has been supported out of the box on any high profile customer facing OS.
In this one very narrow sense, we are agreed, if we are talking about Linux on root. IMHO it should also have been virtually everywhere else. It should have been in MacOS, etc.
However, I think your particular comment may miss the forest for the trees. Yes, ZFS was difficult to set up for Linux, because Linux people disfavored its use (which you do touch upon later).
People sometimes imagine that purely technical considerations govern the technical choices of remote groups. However, I think when people say "all tech is political" in the cultural-war-ing American politics sense, they may be right, but they are absolutely right in the small ball open source politics sense.
Linux communities were convinced not to include or build ZFS support. Because licensing was a problem. Because btrfs was coming and would be better. Because Linus said ZFS was mostly marketing. So they didn't care to build support. Of course, this was all BS or FUD or NIH, but it was what happened, not that ZFS had new and different recovery tool, or was less reliable in the arbitrary past. It was because the Linux community engaged in its own (successful) FUD campaign against another FOSS project.
Canonical took a team of lawyers to deeply review the license in 2016. It's beyond my legal skills to say if the conclusion made it more or less of an issue, at least the boundaries should now be more clear, for those who understand these matters more.
How are the memory overheads of ZFS these days? In the old days, I remember balking at the extra memory required to run ZFS on the little ARM board I was using for a NAS.
That was always FUD more or less. ZFS uses RAM as its primary cache…like every other filesystem, so it if you have very little RAM for caching the performance will degrade…like every other filesystem.
But if you have a single board computer with 1 GB of RAM and several TB of ZFS, will it just be slow, or actually not run? Granted, my use case was abnormal, and I was evaluating in the early days when there were both license and quality concerns with ZFS on Linux. However, my understanding at the time was that it wouldn't actually work to have several TB in a ZFS pool with 1 GB of RAM.
My understanding is that ZFS has its own cache apart from the page cache, and the minimum cache size scales with the storage size. Did I misundertand/is my information outdated?
This. I use it on a tiny backup server with only 1 GB of RAM and a 4 TB HDD pool, it's fine. Only one machine backs up to that server at a time, and they do that at network speed (which is admittedly only 100 Mb/s, but it should go somewhat higher if it had faster network). Restore also runs ok.
Thanks for this. I initially went with xfs back when there were license and quality concerns with zfs on Linux before btrfs was a thing, and moved to btrfs after btrfs was created and matured a bit.
These days, I think I would be happier with zfs and one RAID-Z pool across all of the disks instead of individual btrfs partitions or btrfs on RAID 5.
To give some context. ZFS support de-duplication, and until fairly recently, the de-duplication data structures had to be resident in memory.
So if you used de-duplication earlier, then yes, you absolutely did need a certain amount of memory per byte stored.
However, there is absolutely no requirement to use de-duplication, and without it the memory requirements are just a small, fairly fixed amount.
It'll store writes in memory until it commits them in a so-called transaction group, so you need to have room for that. But the limits on a transaction group is configurable, so you can lower the defaults.
What you need is backup. RAID is not backup and is not for most home/personal users. I learnt that the hard way. Now my NAS use simple volumes only, after all, I really don't have many things I cannot lose on it. If it's something really important, I have multiple copies on different drives, and some offline cold backup. So now if any of my NAS drive is about to fail, I can just copy out the data and replace the drive, instead of spending weeks trying to rebuild the RAID and ended with a total loss as multiple drives failed in a row. The funny thing is that, after moving to simply volumes approach, I never had a drive with even a bad sector since.
Oh I have backups myself. But parent is more or less talking about a 71TiB NAS for residential usage and being able to ignore the bit rot; in that context such a person probably wouldn't have backup.
Personnaly I have long since moved out of raid 5/6 into raid 1 or 10 with versionned backup, at some level of data raid 5/6 just isn't cutting it anymore in case anything goes slightly wrong.
Yep, I get that. I was from there. My NAS is almost 10 years old now, and there are just above 60 TiB data on it, there is nothing I cannot really lose on it. I don't really have a reason to put a 20 bays NAS at home, so simple volumes turned out to be a better option. Repairing a RAID is no fun. I guess most of the ordinary home user like me should probably go with simple volumes. The cost and effort required for a RAID just doesn't justify the benefit for most home users.
It was ext4, and I’ve had it happen two different times - in fact, I’ve never had it happen in a ‘good’ recoverable way before that I’ve ever seen.
It triggered a kernel panic in every machine that I mounted it in, and it wasn’t a media issue either. Doing a block level read of the media had zero issues and consistently returned the exact same data the 10 times I did it.
Notably, I had the same thing happen using btrfs due to power issues on a Raspberry Pi (partially corrupted writes resulting in a completely unrecoverable filesystem, despite it being in 2x redundancy mode).
Should it be impossible? Yes. Did it definitely, 100% for sure happen? You bet.
I never actually lost data on ZFS, and I’ve done some terrible things to pools before that took quite awhile to unbork, including running it under heavy write load with a machine with known RAM problems and no ECC.
so I can consider myself very lucky and unlucky at the same time.
I had data corruption on zfs filesystem that destroyed whole pool to unrecoverable state (zfs was segfaulting while trying to import, all recovery zfs features where crashing zfs module and required reboot)
the lucky part is that this happened just after (something like next day) I migrated whole pool to another (bigger) server/pool so that system was already scheduled for full disk wipe
I think this is over the top for standard residential usage.
Make sure you have good continuous backups and perhaps RAID 1 on your file server (if you have one) to save efforts in case a disk fails and you are more than covered.
I find this thinking difficult to reconcile. When I setup up my workstation it does usually take me half a day to sort out an encrypted rootfs mirrored volume with zfsbootmenu + linux, but after that its all set for the next decade. A small price for the peace of mind it affords.
24 drives. Same model. Likely the same batch. Similar wear. Imagine most of them failing at the same time, and the rest failing as you're rebuilding it due to the increased load, because they're already almost at the same point.
Reliable storage is tricky.