Hacker News new | past | comments | ask | show | jobs | submit login
Real world SSD wearout (okmeter.io)
352 points by MBCook on Aug 27, 2018 | hide | past | favorite | 146 comments



Place I work got hit HARD by this. Year and half before I started at this company they setup 4 SSD's in a RAID 5 config.

One day while working on things I noticed a message on our server indicating a PDR1001 Error (Predictive Failure). So we ordered a new one. The new SSD arrived. We popped in the new one and the RAID started to rebuild.... Lo and behold during that operation Drive 1 threw the same error and the whole thing came crashing down...

We ended up losing the whole array. I had NO idea we had SSD's in the system. I had no idea that no one was monitoring their life.... The moment I saw we had this issue I saw the writing on the wall. 4 SSD's in a raid 5 all installed at the same time... means all SSDs end up with the same approximate critical end of life.

All I could do was shake my head at the whole thing... Pretty sure those in the charge who setup the array still don't understand why this situation was 100% avoidable...


Besides the ssds, who in their right mind uses raid 5 anymore? It's been dead to me for years... since the first time I had to rebuild an r5 of 4tb disks and did the math on the time window for cascades. Also, not to nitpick, but you shouldn't wait to order a replacement that gets shipped to rebuild, you should have spares ready to go at all times for all raid systems. That might have been the difference between a cascade and a normal rebuild.


I'm planning to use RAID5 in an upcoming N/DAS build. Though it's not a server-level setup, so I'm using SnapRAID+MFS which can tolerate more than 1 drive failure in this case, atleast without loosing the entire array.

RAID5 definitely has a purpose, notable small arrays where RAID6 would be wasting space or if you have nodes to failover too (ie, you have 3 nodes, run each with raid 5, if one goes belly up during a rebuild you still have 2 left and you can reprovision the third).

RAID6 would likely be at the limit of most modern implementations so you'd have to jump to (n,n-3) RAIDs or higher. Of course there is always RAID1(0) if you like 50% storage efficiency.


> RAID5 definitely has a purpose, notable small arrays where RAID6 would be wasting space

This doesn't quite make sense to me. Why would the size of the array make any difference as to whether or not the extra drives needed by RAID6 can be characterized as waste?

> or if you have nodes to failover too

In essence, that's RAID5+1, which makes me wonder: If you have node-level redundancy, why bother with the intra-node redundancy?

> RAID6 would likely be at the limit of most modern implementations so you'd have to jump to (n,n-3) RAIDs or higher.

It's not clear to me what limit you're referring to, but, considering how long RAID6 has been implemented, especially in hardware, and how far computing power has increased since then, such a claim is dubious, if not extraordinary.


> Why would the size of the array make any difference as to whether or not the extra drives needed by RAID6 can be characterized as waste?

RAID risk is about your rebuild failure rate, in this case URE (Unrecoverable Read Error). If one such is likely to occur during your rebuild, that can lead to a lot of problems.

In RAID5 until you hit about 2 TB disks and up to about 4 or 5 disks total the risk of a URE during the full read of the array is fairly low. Once you go above that you risk loosing data to URE errors.

In RAID6 you essentially multiply the URE rate together, which means you have a much much lower error floor and you can repair bigger arrays.

In a cost-benefit analysis this means that if your array is small enough, a RAID5 gives you more effective disk space with little additional risk.

>If you have node-level redundancy, why bother with the intra-node redundancy?

A failover is still a failover and can reduce performance and it reduces your remaining failover margin. You want to keep your failover rate low, though if you doN't particularly care you can use RAID 0 too.

>It's not clear to me what limit you're referring to, but, considering how long RAID6 has been implemented, especially in hardware, and how far computing power has increased since then, such a claim is dubious, if not extraordinary.

This is not a CPU power limit, rather RAID6 will in the next few years hit the spot where the risk of loosing 2 drives during a rebuild, due to the large drive sizes, becomes large (URE is still low IIRC), so at that point you want more redundancy to keep the rebuild risk low.


> RAID risk is about your rebuild failure rate, in this case URE (Unrecoverable Read Error).

This seems to be the keystone to your reasoning, and I'm not sure it's true, in practice. IME and in the literature I've read, modern RAIDs operate assuming a whole-disk failure (both on write and on read). Sub-whole-device errors seem to be an issue of concern limited to ECC in RAM. Do you have pointers to literature about this in the context of RAID?

> A failover is still a failover and can reduce performance and it reduces your remaining failover margin.

I think that means that the node-level redundancy isn't actually equivalent to RAID1-style redundancy (where there's no explicit failover), which is important to mention.

Of course, my question wasn't really "why bother" so much as asking if that architecture isn't far more wasteful than, say, RAID1 (or even just RAID6/RAIDZ3) intra-node and less node-level redundancy.

> You want to keep your failover rate low

I'm pretty sure best practices are the opposite of this. One can only know if the redundant copy actually works by using it, so a system that can failover more often is better. Ideally, the system is, like RAID1, requires no explicit failover and merely load balances both copies.

> RAID6 will in the next few years hit the spot where the risk of loosing 2 drives during a rebuild, due to the large drive sizes, becomes large

You'll need to quantify these, though, since, though drive capacities have continued to grow faster than transfer speeds (for mechanical drives), it's still in the OOM of 30 hours for minimum rebuild time. What's the risk of a second and third drive failure in 300 hours (1/10th the maximum rebuild rate)?


>Do you have pointers to literature about this in the context of RAID?

URE Rate is documented in most handbooks or manuals on harddisks, it's usually about 10^-12, though this is a worst case error rate and -13 or -14 is more realistic. At that rate you means your harddrive will return a read error every terabyte to every hundred terabytes.

A URE can cause the raid controller to think the HDD is defective and crash the array if no parity is left to compensate.

>Of course, my question wasn't really "why bother" so much as asking if that architecture isn't far more wasteful than, say, RAID1 (or even just RAID6/RAIDZ3) intra-node and less node-level redundancy.

Depends on your use case.

>I'm pretty sure best practices are the opposite of this.

Best practise is to test your failover regularly, correct, but this doesn't mean you shouldn't minimize your actual failover rate.

>What's the risk of a second and third drive failure in 300 hours (1/10th the maximum rebuild rate)?

It's not that low, especially since a lot of arrays contain similar enough harddrives that concurrent failure is possible (although you can make it less likely by avoiding similar batches). And again, that URE rate has a risk of marking of your array as dead if you get unlucky.


> URE Rate is documented in most handbooks or manuals on harddisks, it's usually about 10^-12, though this is a worst case error rate and -13 or -14 is more realistic.

The kind of literature I'm looking for is based on real world data, since spec sheets are generally marketing documents and therefore, at best, uninformative. This is particularly noticeable when a single number is used for a statistic, as in this situation.

Let's take a current datasheet [1] which actually says 1 sector (unclear if that's 512 or 4k, but let's assume 512) per 10E15, max, a full 3 OOMs higher than what you mentioned. That's 512PB, which, at the highest listed transfer rate, would take almost 545k hours, corresponding to slightly above a 1.6% AFR for worst-case URE-caused failures.

Since I don't believe the spec sheet's 0.35% (average) AFR, but since real-world AFRs are in the low single-percent, I'm staying with the conclusion that read errors need not be considered separately from any other underlying cause of drive failure.

> Depends on your use case.

I'm not convinced that "waste" can ever depend on use case, when examined under the narrow lense of redundancy. Regardless, it was a use case you, specifically brought up, so the question remains open.

> Best practise is to test your failover regularly, correct

You're misunderstanding my assertion, which wasn't about regularity but frequency. I made no statement as to regularity (and it may even be better, with discrete failovers, to do so irregularly, rather than regularly).

My assertion is that best practice dictates frequent failover.

> It's not that low

> risk of marking of your array as dead if you get unlucky.

You still haven't actually quantified the risk. "Unlucky" and "possible" are too hand-wavy to engineer around.

Quantifying the risk, even if it's a very coarse approximation, is a prerequisite to avoiding FUD-based decisions/actions. Usually, the most visible consequence of the latter is the waste alluded to earlier, manifesting as higher cost and/or lower performance. The less visible consequence is misallocated resources/attention from something relatively likely (e.g. human error causing catastrophic data loss) to something much less likely (e.g. single sector read error causing catastrophic data loss).

[1] https://www.seagate.com/www-content/datasheets/pdfs/exos-x-1...


>Let's take a current datasheet [1]

Of a high profile and high quality enterprise disk, unlikely to appear in 99% of RAID setups. Congratulations.

Most RAID setups in the wild consist of low quality enterprise or even desktop platters, high quality setups are rare and expensive.

Esp. considering that lots of consumers have a NAS or MyCloud/eqv. at home and SMB doesn't buy those harddrives either, I don't think this is a valid comparison.

[https://www.seagate.com/files/www-content/product-content/na...]

[https://www.seagate.com/staticfiles/docs/pdf/datasheet/disc/...]

These drives here have an OOM higher error rate on the data sheet and they're still not very common, plus the error rate is given in bits here, shaving of 3 or 5 OOM again, calculating it out gives you an bad bit every 12 TB or so. Depending on how your RAID works, this can ruin the entire stripe or the sector it gets the hit on.

12TB isn't a lot and there is a good probability that a 3x4TB RAID5 array will hit such an error and a 3x2TB RAID5 has a 50/50 chance of having it during a rebuild.

>I'm not convinced that "waste" can ever depend on use case

It definitely can. Consider the use case of "office documents" vs "high performance video storage". With office documents a high throughput is not needed so we can reduce storage costs by using a RAID5 or 6 depending on array size. Video storage for example for CCTV requires more throughput an a RAID10 will give you better performance at some lost effective storage.

You can of course put your office documents on a RAID10 but that is the wasted space. It is not necessary to maximize the RAID performance and the low storage needed for office documents means you can use RAID5 on small arrays fairly safely, RAID6 if you need more.

Waste is a matter of efficiency in all domains, in this case means picking the RAID level that maximizes storage efficiency, performance goals met and safety as best as possible. Picking one that trades off one for the other is waste or even dangerous.

>My assertion is that best practice dictates frequent failover.

I don't think I ever met a sysadmin that insistent frequent failover due to failure is considered best practise. Regular testing yes, regular failure, no. Failure is expensive, testing not.

>You still haven't actually quantified the risk. "Unlucky" and "possible" are too hand-wavy to engineer around.

They're not really, plus you can easily quantify risks as you have done using datasheets. Though the moment your array isn't a singular batch of exactly the same harddrive these estimations become a lot harder.

You don't need to exactly quantify the risk, it is sufficient to know the margins of the risk or even if you are going into the margins of the risk, ie a risk model.

I don't need to know the exact and perfect probability of a read error on the array, I need to know if the array is likely to experience one (with likely being >1% or any other >X% during a rebuild or during normal operation). This question is more easily answered and doesn't require more than estimations.

This trades cost efficiency for safety margins meaning that an array has a comfortably low actualized risk of failing. That is what I consider actual engineering there.


> Of a high profile and high quality enterprise disk, unlikely to appear in 99% of RAID setups.

> Most RAID setups in the wild consist of low quality enterprise or even desktop platters, high quality setups are rare and expensive

> they're still not very common

More extraordinary claims, requiring extraordinary evidence. Even Baraccuda Pro models (which do cost more but not prohibitively so) have the same read error rate, but I didn't choose that datasheet because it had less data, overall.

> Esp. considering that lots of consumers have

I call "red herring" on this, since consumers also aren't going to have the kinds of choices we're discussing, nor read these forums, nor the original article (which is the context for this whole discussion).

> the error rate is given in bits here, shaving of 3 or 5 OOM again

I'm not convinced of that, since the tables look identical between the drives. Maybe it's a sneaky marketing ploy, but maybe not. Ultimately, you need real world data, which you consistently haven't provided.

Absent that, it seems as though you're relying on assumptions, and my original conclusion, based on the data that has been published by the likes of Google and Backblaze, stands.

> Waste is a matter of efficiency in all domains

You seem to have re-defined waste, so I can't really speak to it.

> I don't think I ever met a sysadmin that insistent frequent failover due to failure is considered best practise.

I fear you are, again, misunderstanding. I didn't mention failure as a cause, although the term "failover" could lend itself to confusion, with the substring "fail" being in it. I used the term only in the sense of "switchover".

Perhaps I merely misunderstood you originally. You did initially state "A failover is still a failover and can reduce performance" which, even assuming you meant failover-due-to-failure, the assertion is questionable in the context of justifying node-level redundancy on top of RAID-level redundancy, if switchover (not due to failure) is engineered to be frequent (or even continuous).

> you can easily quantify risks as you have done using datasheets

I'm not agreeing or disagreeing as to its ease, but I'm asking you to go ahead and perform this, which you assert the ease of, since that seems to be the basis of your point.

(Earlier, I just made some single-disk calculations based on the spec sheet, not array risks.)

> I need to know if the array is likely to experience one (with likely being >1% or any other >X% during a rebuild or during normal operation). This question is more easily answered and doesn't require more than estimations.

Agreed. As I mentioned, coarse estimates (if based on real data) are plenty good enough. However, even a coarse estimate assigns some number to it.

Given your question above, what's the answer, for likely-being->1%, during a 300-hour rebuild? How did you arrive at that answer?


Yea any time I'm using a raid setup (as opposed to a cluster or other more intelligent system like zfs) I insist that there be at least one hot spare and a cold one ready. And never let anyone convince you a raid is a backup


Hot and cold spares are rather wasteful. Just RAID1 and then use software (i.e. ceph) to make that redundant across multiple enclosures and ultimately, geographic zones/DC's.


That strikes me as a bit contradictory, since I would consider RAID1 (plus the redundancy across enclosures) to be the equivalent to even "hotter"[1] spares. Unlike the hot (warm) and cold spares for the RAID5/6, all the drives in the RAID1 are continuously in use, which means they're subject to wear-related failures [2].

If you're getting performance benefits from the RAID1, then the extra drives may not be wasted, but that's a separate topic.

[1] Which I argue is a misnomer. To me, "warm spare" makes more sense, since it's powered up but not actively synced in any way. Some systems, configurably, even spin down such spares.

[2] As I mentioned in https://news.ycombinator.com/item?id=17855632 some failures are power-on/spun-up related, which makes the availability of a truly cold spare even more beneficial.


Depends on your RPO and RTO tbqh, if you have the budget to do that in software you are either fairly small or able to implement changes without the business yanking your chain and many are willing to just pay more for their storage layer than re-architect existing apps.


As someone looking to backup a few terabytes of photos and videos, where I do have them on a raid, what constitutes an actual backup?


An actual backup, would be a copy of said photos and vidoes on:

- An external Drive. (which is only connected to copy files)

- An external cloud service.

- Another computer that you have.

You should probably do a couple of these.

I personally backup home computers (using borg) to a home server, that server has a 2.5 2TB external HDD connected to it (2 other 2.5 external drives are kept outside of the house). A backup of important files from the nas (including the computer backups) gets copied over to the external drive nightly. Weekly the drives gets rotated.

The really important stuff is also backed up offsite on a daily basis.


My general rule is that there are at least 3 copies in different physical locations, at least 2 of which are not within "tornado distance" of each other.

The way to think about backup and DR is "what would it take to destroy all of this?" and keep making the answer more and more extreme until it is so horrible that if it actually happens you won't care about what you lost.

PS: Also always remember that a backup you don't test isn't actually a backup.


Basically like others said, there's some kind of physical separation. That way if the machine were to explode or you loose too many drives in the array you still have a copy of them. RAID protects you from a hardware failure (to some extent), a backup protects you from hardware failure, software failure, and human failure (or it should anyway). The idea is that if the cannonical version of things gets destroyed somehow the backup is available to rebuild, so it can't be alive in the same machine as the cannonical one (except maybe when doing the backup itself).

The other "rule" that many people follow is 2 is 1 and 1 is none. The idea is that anything that isn't backed up isn't protected and can't be relied upon to exist.

A cloud service like backblaze, google drive, amazon cloud drive, etc. is a good secondary backup for a lot of people even if it'll take you a month or two to get your data there to begin with.


> or other more intelligent system like zfs

I thought zfs was not incompatible with RAID. How else would you run it out of curiosity?


When using ZFS on top of external RAID hardware or software, instead of using the builtin RAID-Z feature, many unique crash-safe and data-corruption prevention features provided by ZFS are lost. The best practice is avoiding using other RAID mechanisms when possible.


With the current drive capacities even RAID6 is at its limits (or in fact, already past them).


What superseded RAID 5?


Generally variations on RAID 10.

Disks are large and cheap enough to afford losing 50% of your capacity, and it's much faster while in use and when rebuilding.


RAID6, which provides two drives worth of parity compared to RAID5's single drive worth of parity, is the most common successor, especially in hardware (ASIC) implementations.

Other examples are ZFS's RAID-Z, which can support even more parity for even more resiliency to drive failure.

Contrary to a sibling comment, RAID1+0 does not supersede RAID5, as it has existed at least as long and has always had different trade-offs.

How many drive failures one may need to be able to survive, given historical drive failure rates [1], current drive capacity vs. transfer rates (i.e. minimum rebuild time), and individual parameters (e.g. number of drives per array, acceptable magnitude of performance degradation during rebuild), is left as an exercise to the reader.

As the OC suggested, this number can still be 1 (i.e. RAID5) for SSDs.

[1] optionally including or excluding "black swan" events such as the flooding in Thailand that wiped out thos disk factories, rendering an entire "generation" of HDDs, manufactured in haste elsewhere, far less reliable


> since the first time I had to rebuild an r5 of 4tb disks and did the math on the time window for cascades.

Care to elaborate on this? Disk specs? Resilivering time?


Could you dumb it down? Are you saying that because all the SSDs were from the same manufacturer and installed at the same time their chance of collectively wearing out simultaneously was high?


That's true for any type of disks. If you install a disk array using disks from the same lot from the same manufacturer, it's extremely likely that you'll get disk failure at more or less the exact same time. You always want to use mixed lot numbers or disks with significantly different manufacturing dates in your arrays. Some people buy spares from multiple manufacturers, others buy from multiple vendors since it's unlikely to get the same lot that way.

RAID 1 naturally spreads writes across all disks. RAID 5 is designed to spread writes evenly across all disks. Sure, certain applications will do uneven writing, but in general it will be fairly even and this is by design. RAID 3 and RAID 4 were abandoned partially because they used dedicated parity drives, and the result was that the parity drives had much higher write loads and so they failed all the time. This meant that the arrays were more often rebuilding or running with degraded protection.


> You always want to use mixed lot numbers or disks with significantly different manufacturing dates in your arrays.

For the few of us left who know how to do this (or even that it's beneficial to do in the first place), it's becoming less practical to do at scale.

It's hard enough finding someone who will hire me and use any of These Thngs I Know about operating hardware, isntead of just my "automation" skills, as if all problems can be solved in software.


> as if all problems can be solved in software.

Isn't this one true to some extent though? We already have scrubbing to detect/fix bitrot. Why not make it (slightly) purposefully unbalanced, so you can avoid the simultaneous failure. I expect your time is worth more money than making one drive fail a few weeks early.


> Isn't this one true to some extent though?

People who write software for a living tend to think so, but the "some extent" is the real issue. For many classes of problems, merely spending more money on hardware (or procedures/process) is objectively better, but one has to know the alternative exists and be willing to make the comparison. Some problems just have physical limitations.

> Why not make it (slightly) purposefully unbalanced, so you can avoid the simultaneous failure.

I think the short answer is because drive failure is non-deterministic and relatively unlikely in the general case.

> I expect your time is worth more money than making one drive fail a few weeks early.

My (ops) time may or may not be worth more than the software engineering time required to make that drive fail early. (NB that the main cost of the hardware solution isn't necessarily time but is often reduced availability and therefore higher cost of suitable parts).

Your proposal also still ignores the fact that some of these drive failures may simply not be manipulable by software. For example, any failure that's correlated with power-on (or spun-up) time, rather than usage, such as bearing failures, could still happen simultaneously (and affect hot spares, a nightmarish situation).

The tried-and-true engineering solution, which happens to be hardware/process based, actually works, and can be shown to address nearly all known drive failures [1], and has a measurable cost. The same can't be said for a software-only attempt to replace it.

[1] firmware bugs, that do things like returning reads of all-zeros on just-written blocks, being a notable exception.


> I think the short answer is because drive failure is non-deterministic and relatively unlikely in the general case.

If it's deterministic enough to be worth sourcing drives from different batches, why wouldn't it be enough to add small amount of writes on purpose?

> Your proposal also still ignores the fact that some of these drive failures may simply not be manipulable by software. For example, any failure that's correlated with power-on (or spun-up) time, rather than usage, such as bearing failures, could still happen simultaneously (and affect hot spares, a nightmarish situation).

Power cut / spinup / other conditions can be replicated from the OS level as well. I didn't list rather than ignored them. It does sound like a good idea to do those as well, considering it could save you from losing all the drives after a power loss / system crash.


> If it's deterministic enough to be worth sourcing drives from different batches,

I suspect you're using a mistaken premise.

It's worth sourcing from different batches because failures are not deterministic. Instead, we merely have probabilities based on past experiences (usually from vast data generously provided by operators of spindles at huge scale).

> why wouldn't it be enough to add small amount of writes on purpose

Well, it's not enough, because it might only protect against simultaneity of certain failures. It also doesn't actually reduce the potential impact of the failures, merely buying more reaction time. By distributing a single batch of drives across many arrays, even a simultaneous failure is just increased replacement maintenance cost (if that's even the strategy, rather than enough hot spares and abandon-in-place), without the looming data loss. With the software staggering of write amplification, each failure could be the start of a cascade, in which case replacement takes on a time-critical aspect. This replacement emergency ends up being an operational (not software) solution, as well.

My worry would be that the software scheme provides a false sense of security.

Additionally, you may want to quantify what "small amount" is, considering you're suggesting such an algorithm would allow for failure multiple weeks apart. 3 weeks is 2% of 3 years. For an array of 12 drives, does that mean that the 12th drive would need 22% the writes of the 1st drive?

Of course, beyond any performance hit, write amplification for SSDs has other deleterious effects (as per the article). A software solution would have to account for yet another corner case.. or just stop trying to re-invent in software what already has a pretty comprehensive solution in operations.

> Power cut / spinup / other conditions can be replicated from the OS level as well.

Not necessarily, although I suspect that true nearly always on modern equipment. However, that's not what I meant. What I meant was failures that occur more frequently merely with the time the drive has spent powered on (or powered on and spinning). Even if that could be simulated relativistically somehow, that wouldn't be a software solution, either.

Also, adding a "chaos monkey" of the kind that powers down a drive in a running array would both introduce a performance hit that I expect a majority of environments would find unacceptable (more than would find write amplification acceptable) and would introduce additional wear and tear on mechanical drives. The latter may be worth it, but I'd be hard pressed to quantify it. It would be different if limited to hot spares, but that's also of limited utility.

You'd also have to be extremely careful in implementation, as a bug here could make a previously viable array into a data-lost array. If such a technique reveals a drive failure, I'd want it to stop immediately so as to be able to replace it with a different one from a different batch and have enough replacements on hand, in case all the rest suffer the same fate.

> I didn't list rather than ignored them.

Unfortunately, it's impossible to tell the difference in discussions on this topic, because, as I mentioned, so few people have first hand knowledge (or have done the research). Even before "the cloud", there was more mythology than hard data (including about temperature, until Google published data debunking that).


If you are willing to move in Geneva I believe that CERN or any of the LHC experiments could use your skills.

It may be worthed to visit their career page.


Relocation isn't something I'm open to, at this point. (Being in the SF Bay Area, I'm not yet worried that this limits me excessively).

I suppose it's also worth noting that I'm sceptical that any organization that large wouldn't have an equally narrow interest in my skillset.

My goal is the able to apply as close to the full breadth of what I know and can do, rather than something like specifically avoiding automation or specifically exercising my storage knowledge. For that, startups and other small companies seem best, though, oddly, not lately.


Most definitely yes.

But in addition to this, standard Raid5 does not periodically read the data, so it's actually rather common for issues to only arise on a resilver.

This is why proper maintenance in ZFS is to run ZFS Scrub (basically check every file) once a week.


Once a week sounds extreme unless we talk about a smallish SSD-only pool. You may wear out your (spinning) disks with scrubs more than you do with real workloads. Also, depending on a pool size it may take days for a scrub to complete.


> You may wear out your (spinning) disks with scrubs more than you do with real workloads.

This seems like an extraordinary claim requiring extraordinary evidence, especially since the notion of wear out is only applicable to SSDs (as shorthand for write endurance).

I certainly believe that mechanical disks, with all those moving parts, can have their failure rates increased by increased use, but it's not safe to assume even something as high as a linearly proportional relationship, considering which parts move when.


That's also proper RAID5/6 maintenance. My main/recent familiarity is with LSI hardware RAID implementation, where they call it a "patrol read". I believe mdraid has checkarray.

I'm not sure if you meant to imply that ZFS is different from standard RAID in this regard, but it doesn't seem as though it is.


It's called scrubbing.


Did you mean to reply to my above question? If so, I'm unclear as to what you're trying to get across.

Is ZFS scrubbing different than the other RAIDs' (scheduled or schedulable) reads of the entire array, other than nomenclature?


We use ZFS, how can I check if we are doing a ZFS Scrub every week?


'zpool status' will tell you the last time a scrub was run, or if one is running currently and information about it. Then check your crons and see that they make sense and match what you see with zpool status.


Well it is not every week.

> scan: scrub repaired 0 in 0h36m with 0 errors on Sun Aug 12 06:00:50 2018

I don't see any cron jobs though...


Given the speed of that, I'd bet you don't have a huge pool (or if you do, that's a really nice speed). I'd bet someone's doing it manually. That's what I do for my own systems (about once a month) since they're not heavy use (personal and parent's file servers).


There are two nasty thing going on with raid5: a) You can tolerate 2 drive failures, no more. One less, but if you tolerated one, you can tolerate one more, and no more.

AND b) raid rebuild causes MASSIVE stress on the remaining drives. Seriously massive stress, beyond what distributed systems or raid 1 / 6 based systems do.

This occurs because RAID5 both rewrites parities, but it also has to re-read data from all drives while writing parities to all drives. That's a lot of random access and especially spin-drives and larger drives (see the coincidence?) dislike that. That tends to cause similar-aged drives to die and then your raid is gone.

I'd suppose this is less hard on SSDs than HDDs. But there's still a lot of rewrites going on, and SSDs don't like that either.


a) RAID5 can fault precisely one drive without data loss. Two overlapping errors (e.g. two errors in one stripe) and you're up shit creek without a paddle. This is the generic definition for RAID5, not an implementation detail.

b) What makes you think RAID6 doesn't also incur this? The only difference is that RAID6 also includes a Q parity block in each stripe, so the only thing you get saved from is if you don't need to read the parity on the stripes, you save 1/(N+P-1) IOs per drive.

RAID6 is still going to need to recompute ~2/(N+P) parities (one P, and one Q) for rebuilding a drive over (N+P) stripes; and reconstruct the data for the rest (depending on how P and Q are implemented, they could interweave which they use for reconstructions, but AIUI it's generally more expensive to recompute from Q than P, and R than P or Q, in many instances of this math).

c) Many RAID systems can rebuild starting from the starts of the respective disks and streaming along (or, in recent ZFS's case, coalescing the IOs to be in sequential order groups and issuing them), though certainly not all of them.

The logic "usually" goes that RAID5/RAID6 rebuilds are dangerous because they involve reading all the bits, so to speak, so if you don't have an equivalent of scheduled patrol reads to be sure bits at rest that haven't been read by users haven't gone south, you'll first discover this...during a rebuild, and with RAID5, you're SOL.


This happens with spinning disks as well, you have to read all the disks completely from time to time. Many parts of the disk are only used during a sync and that is the only time you don’t want to find out it’s unreadable.


There’s so much wtf in this. Raid5? No. Also, you have to be sure not to fill the drives up. Creates a pathological wear situation.


> Also, you have to be sure not to fill the drives up. Creates a pathological wear situation.

It's a sliding scale. SSDs intended for use in servers usually have more spare area than client/consumer SSDs, and most of their specifications assume they're full. A lot of enterprise SSDs also include features to allow the user to adjust the usable capacity, and the write endurance rating and warranty will scale to match.


This is a useful datapoint, but I feel the author could learn a thing or two from the backblaze disk report. It looks like there are a few use cases that cause high wear on SSDs. I dont see a lot of concrete numbers here. To me, it is more useful to say something like: these ssds will last 2 years given that their sustained write throughput is X GB/sec on average.

From the SSD torture tests Ive seen, it is many petabytes of data that the average SSD must write before getting anywhere near "weared out".


Some care has to be taken when saying a "petabyte of data" with flash media. You don't have to write a petabyte of data to reach a petabyte of SSD writes. The minimum block size that flash can write is an "erase block", which is usually a few megabytes.

This means that the minimum SSD write will be at least a few megabytes, including that few hundred bytes to that logfile. Likewise, every flushed write will round up to the nearest erase block, in size.

So, to write a few petabytes to the drive, you only need a few gigabytes of writes of per-byte-flushed data, since your actual NAND writes are amplified by a few million.

There's cache and many smart algorithms at play to minimize the number of erase block writes, but once you cause a flush (close the file, etc), you're performing a write to flash.

This is also why any embedded system with logging enabled has a very real maximum operating life. This is fun to discover the first time, when all of your customers start saying your product stopped working within the same couple of months.


Some clarifications: Minimum SSD write is much less than the erase block. Erase block is the minimum block size that can be erased. For SSDS, there are 2 key concepts: 1. Read granularity < write granularity < erase granularity 2. Cells that have been written to must be erased before writing again.

Most SSD vendors will buffer write contents to collect a `write granularity` worth of data to avoid wasting bytes writing padding. This can be hard to do on drives without capacitors to supply backup power in the event of power loss.


I stand corrected. Our application was a circular buffer, so all non-cached writes trigger an erase. I suppose this isn't a normal use case, unless your drive is 100% full, so maybe tens of gigs rather than gigs, to write a petabyte. ;)


This is mistaken. Erase block size != minimum write page size, and further, FTL's are quite now sophisticated in how they coalesce and manage writes.


The most common size I'm seeing for erase blocks is 128kB. The only drives I found with erase blocks over 1 MB are the Samsung TLC drives like the 840.


You must be looking at really old info, which is understandable because page and erase block sizes for current NAND is often hard to get. The first generation of Intel/Micron 3D NAND used 16kB pages and 16MB/24MB erase blocks for MLC and TLC.


https://www.heise.de/newsticker/meldung/SSD-Langzeittest-bee... (German) did a long-term test of 12 drives in 2016. The best drive (Samsung SSD 850) managed 9.1PB, that took a year of constant writing, the worst (Crucials BX200) about 200TB. All drives managed multiple times more than specified.


That is before quite amplification which can be pretty high.

Real value depends on block size and write algorithms employed.


My understanding with flash, having talked to someone who had to write his own wear-leveling code so his hardware could use cheap memory, is it's not so much the pure throughout that can kill a flash cell, but the write pattern that matters. My memory of the conversation is a bit hazy, but it's something like, in order to update even a single bit in a cell, it needs to be deleted and written to, and each cell only has so many write statistically before it's flagged no good. So to detroy a disk, you can strategically flip bits all over your disk, or with a disk with naive wear-leveling maybe just flip the same bit over and over. Good disks will make this harder with tricks like buffers (hopefully backed by batt/cap in the event of power outtage), and very clever wear-leveling algos, but you can definitely wear out flash memory with a fraction of the data and time you might expect if you have a truly pathological workload.


Flash memory is different from a regular drive, and it's to do with how it writes. You can't alter a single bit, you must rewrite the whole block and those can be anywhere from 256KB to 4MB or more. http://codecapsule.com/2014/02/12/coding-for-ssds-part-2-arc...

Adding new bits is cheap, though, so if you can come up with an append-only strategy for your system and periodically combine those into a new, pristine version you'll get far more life out of your SSD than if you go around randomly flipping bits. Is Redis using memory-mapped files the problem here? Treating a file like memory is all fine and good until your random bit-flipping has to be directly persisted.

I wonder if future SSD firmware implementations will avoid wear by having a small buffer reserved for little post-write patches that can absorb a few small changes before having to commit to rebuilding the whole block.

There's talk of "annealing" as a way of restoring cells to full life: https://www.extremetech.com/computing/142096-self-healing-se...

Has anyone tried reflowing a worn out drive?


Flash must be erased on a page boundary, the size of which varies, but ranges from 512 bytes up. The act of erasing a flash page turns all bits to 1s (0xFF for the entire page). When the flash IC breaks, you can usually tell by doing a blank check after erasing. Instead of being 0xFF all over, some bits will be "stuck" to 0. A common reason for a discrete IC flash part wearing out is that charge pumps are used to raise a low voltage like 3.3 V to 12-15 V which is required for erasing. These charge pumps are touchy and can die first, from what I have heard.


From the SSD torture tests Ive seen, it is many petabytes of data that the average SSD must write before getting anywhere near "weared out".

What those tests don't show by continuously writing and reading until absolute failure is the degradation in retention that occurs; a block that has been erased and rewritten over the specified endurance will almost certainly still accept writes, and even read back the same data, but is unlikely to retain that data for the specified length of time (several years, normally) --- instead, erasing itself soon after being written.


> these ssds will last 2 years given that their sustained write throughput is X GB/sec on average

You are forgetting that Okmeter is a monitoring tool, not a backup warehouse. They are talking about the causes the fast wearout was made and how you could detect it with their product. Backblaze on the other hand is all about data storage and they have a good statistics because of massive amount of discs they use, so they are making reports about specific models and how they behave.


before getting anywhere near "weared out".

Well done you, picking on a non-native English speaker! Very welcoming and inclusive.


I agree with you that people need to respect non-native speakers. But attacking someone doesn't promote respect. Please don't attack others on HN.

https://news.ycombinator.com/newsguidelines.html


Is the SMART attribute “media wearout indicator” only available on enterprise-grade SSD and not on consumer-grade SSD? I checked my Samsung SSD as well as NVMe drives and I didn't find it on CrystalDiskInfo.

EDIT: It looks like "wear leveling count" attribute is available on Intel (#233) and some high-end Samsungs (#177), as well as other SSD, but it's not on any of my SSD.


If you're using smartmontools, try running `update-smart-drivedb` first: those fancy attribute names don't come from drives themselves but are always interpreted from (model, firmware, attribute id) tuple by the software you're using to query disks.

Or you can just look for [0] directly if you just want to know what drives Media_Wearout_Indicator is currently defined for.

[0] https://raw.githubusercontent.com/mirror/smartmontools/maste...


I have both "177 Wear_Leveling_Count" and "233 Media_Wearout_Indicator" on an old Samsung SATA II consumer SSD. On a newer one, there's only Wear_Leveling_Count, but there's also "241 Total_LBAs_Written", which gives an interesting metric:

  $ sudo smartctl -A /dev/sda | awk '/177/ {print $2,$4} /241/ {printf "%.3f\n", $10 * 512 / 1024^3}'
  Wear_Leveling_Count 097
  4151.677
Just 4.1TiB written so far causes a 3% decrease in drive "endurance".


Pretty much all modern SSDs support an attribute called "Lifetime remaining", which is basically the same thing as the wearout indicator.

It's attribute 231 for many vendors, 169 for several others, 202 for Crucials and Micron.


"Media wearout indicator" was recorded for OCZ drives, IIRC.


You can enable it. I’ll dig through my notes for the command.


Is there a recommended generalized tool (non-vendor specific) that can be reliably used to determine your SSD's health and outlook?

(Unfortunately I have a mix of SSDs in use, even in single personal computer, and all the vendor individual software gets overly complex and bloated...)


Install smartmontools, then configure it to do a SHORT self test every night, and a LONG self test every week. You can do the same on Windows, but I don't know the tool name. The self tests are built in to the drives and do not require OS support, just a way to start them and read results.

Better make sure the tool has a way to contact you if it finds an error! (Email by default - and test it!)

The long test reads the entire disk looking for errors.

Next, if you are using a Linux MD Raid array install mdadm and make sure it's configured to run a checkarray command (by default every month).


When I open smartmontools it says Number of Reported Uncorrectable Errors is 1,402. Is that an issue? (the short test didn't show any problems)


Yes, that means your hard disk is bad. Sorry :(

A short self test reads from random locations on the disk, it may or may not find anything, it's just a quick test.

Normally your hard disk will reallocate those bad sectors to other places, so you may not have any missing data, but once those bad sectors start they tend to keep going up. At some point your hard disk will run out of alternative places to store data.

But worse, any data in one of those bad locations is lost.

I would get your data off of the hard disk NOW. Then run a long self test and see if the numbers go up.

If your hard disk is still under warranty you should also go ahead and submit a claim.

Caveat: Some hard disks will create a bad sector if you power cycle it right in the middle of writing data. So a couple of those is not an indication of a problem. But 1,402 is rather a lot.

Post the full output of smartctl -a /dev/sda (or sdb, etc, whatever it is) and I can see if I can glean any other info for you.



And crystal disk info can be used as a CLI as well if you want to automate the monitoring.


smartmontools is a good starting point.

On Windows there is Hard Drive Sentinel [1] and Diskovery [2]. Crystal Disk Info is a hit-and-miss, it's too prone to misinterpreting attributes and showing nonsense.

[1] https://www.hdsentinel.com/

[2] https://diskovery.io/


On unix-likes, I believe SMART tools works?


smartmontools also work great on Windows, OS X, ...


Yes they do, see smartmontools.


So I'm confused. I get the wear out level not going lower than 1%, but that doesn't mean the drive failed, right?


That's very dependent on the drive vendor, model, and even firmware version. In many cases, drives can go on working for a very long time after the SMART metrics show it has failed. Whether you want to take the "risk" depends a lot on how critical this particular device is in your infrastructure.


I have a 16GB SSD that used to be in a POS terminal before I bought it second hand. The SSD Wear Leveling Count attribute shows as failed. Still works.

I have chosen to use it as part of a RAID1 for an OS install on a home server. Waiting to see how long it will last.


It's kinda like the expiration date for foods, an estimate.


I think technically it might still work (i.e. you could read/write to it without errors?), but the integrity of the data it is storing would be questionable.


Doesn't it only mean that writes will fail?


Some SSDs go into read-only mode when they have used up their endurance but others will let you keep writing.


It should but due to crappy firmware many SSDs will just drop off the bus and become unreadable.


I would think it would mean that your writes may be corrupted.


In production tests I've seen the synthetic "wear out level" go down to 1% and the machine still writing Petabytes of data without any issues. That being said, I would only do that on machines that are mostly disposable :)


Worth pointing out: good manufacturers like Micron/Crucial publish endurance figures for their drives. Endurance is the number of bytes you can write before the drive is officially worn out. Endurance numbers can vary by two orders of magnitude.

So if you have a write-intensive app like a Jenkins server building a C++ project, you might want to pony up for a server-class drive with better endurance.


>building a C++ project

Build in a ram disk and then copy whatever you need for permanent storage?


not everyone has 64GB of ram.


64GB is cheaper than enterprise SSD.


As an author I would greatly appreciate any suggestions in what kind of stats you would want us to gather and share.


Just FYI, "to wear (out)" has an irregular past tense, and past participle forms: "wore" and "worn". So:

1) These disk, because of constant throughput, wore out.

2) These disk, because of constant throughput, are worn out.


thnx


don't just trust the drive's SMART attributes - try to independently confirm when and how the drive is failing

techreport did a long-term experiment on this (but with limited samples): https://techreport.com/review/27062/the-ssd-endurance-experi...


This article seems to have a trove of interesting data, but struggles to generalize many conclusions out of it.


I think that's because it's a thinly veiled advertisement.


I'm the author. We wanted to showcase some scenarios and that's it.

I would love to hear any suggestions of what else we can gather and report.


Sure... so, showing "here is a weird disk pattern -- they were running X on top of it -- consider not running X on SSD" with a sampleset of 1 is a logical fallacy and kinda a bizarre post.

For small samplesets, going deep to understand unnecessary writes, tuning the clients and showing less SSD wear after tuning would be interesting. Or, assuming you have more than 1 client of each of these situations aggregating the data to show patterns would be far more useful. As has been mentioned elsewhere, for inspiration, Backblaze has really nice posts analyzing their device wear.


Thanks. That makes sense.

But while we do have a lot of clients, I really think that all of their setups are unique in some way. So starting from that we didn't find more Redises on ceph dumping a lot.

> tuning the clients and showing less SSD wear after tuning would be interesting. Is just not an option for us, as we only do monitoring and don't have any ways of tuning.


The advice to consider to disable swap applies equally on (micro)SD and USB sticks. Just make sure you have enough RAM. Also, consider mounting directories such as /var/log as tmpfs (if you don't need the logs after reboot) or use a remote syslog.


What really happens when the level is low ? Less capacity ? Data loss ? Performance decreases ?


I've worked on SSD firmware before. Obvious the implementations may vary but typically there is no hard failure. These are just projections based on data sheet endurance specs. What you need to look at additionally is how many defect blocks there are (another SMART spec) as it will start to rapidly increase as more and more flash blocks go bad.

When the level is low, capacity shouldn't decrease as there are spare blocks. There should be no data loss expect by catastropic block failurs (some SSD have internal RAID like redundancy.) There might be a slight performance decrease as the error correction algorithms become more active.


Those are firmware level error corrections? Or like, driver?


Its in the controller. A combination of hardware and firmware. A large percent of them will be handled by the hardware but a small number requires firmware intervention.


SSD have their own correction and wear leveling algorithms. In this case they were referring to device firmware.


They generally stay fairly consistent in performance, near the end you start getting a whole lot of reallocated sectors and then it just suddenly stops working. Depending on the controller it either locks itself to read-only or becomes totally inaccessible.

https://techreport.com/review/27909/the-ssd-endurance-experi...


This varies by SSD.

Some just fail hard and that ranges from bricking to loss of filesysten intregity.

Others stop allowing writes.

Still others will slow down, or lose capacity.

I have had two die on me after a few years of work involving tons of writes. One cheap SSD just bricked. Gone.

The Samsung I had started doing weird things with capacity.

Personally, I just cycle them when I see a deal out there.

Old ones do need to be powered up, from time to time just like USB thumb drives do, or the contents can be lost.


How does one know how a particular model of SSD will behave? If I had the choice, I'd rather the drive enter a read-only state rather than becoming a brick. So when I'm shopping for a new drive, how do I use this a determining factor in choosing the new drive?

Is it a known thing that Vendor X will brick, Vendor Y will read-only, or Vendor X Model XXX will brick while Vendor X Model YYY will read-only? If so, where can one find this level of information on the inner workings of a drive? Most tech reports are solely focused on read/write speeds as that's what the majority of people are concerned.

Inquiring minds want to know...


It's going to come down to the firmware and controller. That can even be different when the vendor and model are the same.


Doing that as hard. With older ssds you can find reviews and other information. With new ones, I consider it a crapshoot.


"I'd rather the drive enter a read-only state rather than becoming a brick"

Totally agree. However, all of the SSDs that have failed on me all got bricked. I have never seen one that went into read only mode which was disappointing.


"Media wearout indicator" is a vendor-specific attribute.

In fact, there are NO standard (or as the article coyly calls them "basic") SMART attributes at all. A lot of vendors stick certain attributes into same slots, granted, but there's literally no industry-wide spec for any of the attributes. You have to go not by vendor, and not even by device, but by firmware revision to accurately interpret an attribute.

The wearout indicator often sits at slot 233, but sometimes it will be in 230, sometimes - some other slot. Moreover, in some drives (e.g. OCZ or ADATA) 233 is undocumented at all and it will grow from 0 normalized.

So this:

> we implemented collection not of all the attributes, but only basic and not vendor-specific ones 

is misleading and inaccurate.


I'd like to request a request for comments.


You don't request one, you write one.


I did a similar thing and found even lower failure rates (across thousands of drives in a high write database environment). One thing intel recommended was to overprovision the drives. Setting the maxlba at 80% of drive capacity preserves optimal wear leveling. I also used a different drive layout etc. the long and short of it is nand wearout is kinda like quicksand. You grow up expecting it to be a way bigger problem than it actually is in your adult life.


The redis case is interesting. If the 1 minute interval was somewhat arbitrary you could almost double the life of the drive by just bumping it to 2 minutes.


Yes, depending on their use case for this frequent dump file, they may be happier using AOF or much-less-frequent saves instead. I believe a dump file is belt-and-suspenders secondary backup for almost all users.


I just can't trust a man who doesn't trust his suspenders


Maybe it's the belt they don't trust.


It does very much depend on the "SSD" in question, e.g. SD Card and eMMC don't tolerate nearly as many writes as a SATA SSD or NVMe, and for sure the former dislike having power cut. I've had a few name brand SD Cards go permanently read only as a result: neat I can still read my data, but also not neat that I can't erase them before sending them back under warranty.


It's interesting how long SSDs tend to last on development boxes.

I've had a 256GB Crucial SSD running on my primary dev box and it's been powered on for 3.6 years. During that time it's been powered on 94 times. I run a ton of Dockerized apps and other things.

The "wear leveling count" is at 163 which according to Crucial means the drive is at about 95% healthy.


This is why Optane storage is amazing. Wears out about ten times slower, and has incredible 4K random read performance on top of it.

It almost feels like Intel is squandering a massive advantage by not pushing it harder. I know Optane DIMMs are coming soon, but M.2 form factor would be nice as well.


What would you use it for yourself? (disclosure: I work at Intel on Optane-related topics.)


The fast 4K random read is handy for delta backups to hard disk; I can read half a million files in about a minute, determine what changed, and send whatever small changes off to disk.

Another great use case is for web/npm projects with a million dependencies, all flat file. Working with those is usually miserable otherwise.

In the future I intend to crunch time series data with it.


For those of you deploying lots of SSDs for work, how often are you having to replace them due to wear out? Many have 5 years warranties with multiple PB written (or drive writes per day), so I wonder how things are in practice.


Wearout is very uncommon. Biggest thing to do is make sure you don’t fill them up. Full drives are unable to do their optimal wear leveling.


Would you do anything different with your storage software or SSD procurement if you didn't have to worry about filling up SSDs or wear leveling? Wondering how big of a pain point it is (disclosure: I work on new SSD tech at Intel).


I accidentally discovered something similar as well, I saw my Disk usage was 30TB in the past weeks on my MacBook, I thought that is abnormal. Most of my work are browsing, and Editor coding. Turns out I had too many Tabs in Safari, and my Mac was out of memory constantly doing swapping. I was doing anywhere from 100GB to few hundred GB write per day.

If anyone has a easy solution to this would be much appreciated. Right now I a just checking every few days to see if I did something silly that causes lots of paging. There was a time I don't think I need 32GB of Memory, now I think there is a real need.


>If anyone has a easy solution to this would be much appreciated.

-use different browser, Vivaldi supports suspending and lazy tab loading

-add more ram, in your apple case it probably means buying new laptop


On Linux I set the vm.min_free_kbytes on my laptops so they crash something instead of swapping until pull the plug. Maybe XNU has an analogous knob?


I read the related article about Postgres SELECTs causing writes sometimes - I've used Postgres for years and never knew this!

Anyway, I spend most of my time in the software world and occasionally have to setup hardware (or more commonly, diagnose hardware issues). Is there a good guide for how to setup things like Postgres or Redis with recommendations on swap/cache setup, RAID-level selection, etc?


We had one SSD die in production, but it was on a system whose write load was enormous... like 50-60mb/sec sustained for over one year.


Out of curiosity, could some of that write load be shifted over to a temporary RAM drive?

What did the server do?


Do you remember what model it was?


Intel I think, but not positive. It was a higher grade SSD. Could have been an isolated incident but it was definitely abused.


[flagged]


Nationalistic attacks are not allowed on HN. Please don't post like this again.


I think you posted this comment to the wrong thread. (Not that it would be any more appropriate of a comment elsewhere.)


Not to support his fairly trashy post in particular, but I believe his comment has utility.

As an infrastructure person this post was concerning; this company is collecting a lot of data and has a lot of access, which I wouldn't trust. I also would not be thrilled if I was a paying customer having these details shared (even without attribution, as in the article), further reducing trust. I appreciate these kind of real-world detail posts, but it's not appropriate if it's not your infrastructure.


Could you please elaborate on the point? How we can improve?

What potential problems do you see?

Thanks


Sure. First and foremost, do you have permission from your customers who you're researching and reporting on here? If you do, great, ignore me. If not you'd be breaching (my) trust if I was one of them. The data is not yours and it may be possible to infer who these datapoints belong to if so desired. If one could do that, they may be able to gain competitive advantage or otherwise exploit knowledge of infrastructure (social engineering for example).

There is a big difference, IMO, in someone like backblaze releasing statistics. They own all of the hardware and they choose to release the data themselves. You (on the surface) appear to be harvesting data from your customers, digging through it, and presenting it. You also point out very specific cases, rather than aggregate pseudonymous data.

You are collecting sensitive data from your customers environments. This doesn't inspire confidence that you treat it as such.


Their FAQ mentions an on-premises option that keeps data in house. Depending how that works and if it's firewalled off, you might mitigate the spyware angle




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: