Wow, this is a surprisingly sober look at using ML for this particular problem. I really appreciate how the author walks through every step and the rationale behind it, spending as much time on explaining the dataset and cleaning it up as she does on, say, the model they end up using.
It’s a bit of a breath of fresh air to see a post like this where someone clearly puts the work in to try and solve a given problem with ML and is able to articulate the failure modes of the approach so well.
What was the author's rationale for choosing XGBoost? Failure analysis is an old field so I'm very surprised that there's not even a cursory nod to the exponentially distributed failure times that most failure models use. If anything, this post just spends a lot more time than usual talking about data cleanup.
This is very standard in industry (as opposed to academia) and highlights a lot of the reasons why you need ML specialists operating these models, and why you will get garbage in garbage out if you try having non-specialist engineers build, evaluate & operate these types of systems.
For example, in this post the engineer does not adequately handle class imbalance at all, and reaches for a hugely complicated model framework (XGBoost) without even trying something like bias-corrected logistic regression first.
I think this post is just an example of an infra engineer thinking “ML is easy” and when their quick attempt doesn’t work, then just bailing and saying “ML is overhyped and doesn’t work for my problem.”
This is not the correct modeling approach. All hard drives will fail given enough time, so labeling the failed hard drives as the positive class will bias your results.
Stuff like this really should be handled using survival analysis.
I agree and have been doing some similar analysis on the backblaze dataset. I suppose you can use this for prediction but I'm personally just interested in post-hoc analysis and (1) getting better AFR estimates when failure counts are low + (2) exploring time-dependence of hazard functions with different priors (GP priors, etc.). This post and your comment have motivated me to make a post this weekend! Thanks!
SMART seems to be extremely useless in practice. Manufacturers don't seem to expose actual failure statistics through it, likely for fear of making their product look bad.
SMART is a liar sometimes. I have first hand experience with faulty Seagate firmware and Equallogic SANs - where errant statistics caused disks to be ejected from the volumes before you could finish a rebuild. Nothing like watching 40TB of data disappear on multiple installations over the course of a few weeks!
This is an interesting experiment because if the model is running on the same hard drive it is detecting, then it is looking at itself dying in the mirror
Fitting a binary classifier on failure doesn't seem quite right to me.
There are some quite natural questions you can't answer unless you can predict some stats on when failure will occur, and if you have any drift in the dataset, you might just find features correlated with age (and hence having more time to fail) that don't inherently increase risk.
Although I'm not sure about this case, in general the approach in the article (tree-ensemble predicting binary classier) works well for many business problems!
Thermal abuse (running at a high temp for many hours), duty cycle, iop rate (high head acceleration versus head dwell) all impact the lifetime.
Early life media issues are predictive of failure and can be detected through stress testing with various seek ranges. It’s not as simple as a SMART metric, but did appear to have predictive value. The one experiment I ran discovered a set of drives which ultimately the manufacturer admitted to a since corrected process problem.
The above factors are different on each drive family and can’t normally be generalized.
The article is fascinating but one thing I am finding very common on these engineering blogs is no easy path back to the company or service. Understandably the logo at the top goes back to the blog homepage and the copyright at the bottom is just static text.
I could search or visit one of the social pages linked to in the footer but it seem like an obvious link one would want to have.
Hitting the “Join The Team” button up top takes you back to the main site, which seems reasonable since in most cases a technical blog is meant to be a recruitment funnel more than a general public-facing PR document, no?
I'm currently using a non-linear version of it, DeepSURV [*], implemented in pycox for a predictive maintenance job. They are much more informative than a binary label and give space to make a business decicision as to how close to EOL you want to take care of the asset.
[*] The underlying neural networks aren't deep at all.
I think this write-up is interesting but structurally incorrect, and everyone here is taking away the wrong conclusions.
The punchline here is:
> Unfortunately, once they were scaled up to production levels of use, their tradeoffs turned out to be too much to bear. These models are good—especially given how imbalanced the data for this problem is—but with too many false positives from one and too little predictive power from the other, they’re not worth betting so much money on.
A practical next-step analysis would be some back-of-the-envelope economic math about the necessary detection threshold for detecting and replacing failing drives. Overall, it doesn't even make sense to even think about applying ML without understanding the cost/benefit of solving the problem.
Think about Backblaze -- the drives are packed in there tightly and it's super expensive to replace one, whereas there's basically zero cost to leaving one in place. Even if you could perfectly predict which drives would fail, there's no reason you'd act on that info.
Didn't read your linked study, the linked blog post threw out the actual data though. Not surprising you get nothing of use when you start with:
> Each attribute also has a feature named raw_value, but this is discarded due to inconsistent reporting standards between drive manufacturers.
Sure, raw_value is inconsistent between manufacturers and sometimes models, but it's the most real data.
Edit to add: reread the Google survey, which says more or less bad sector counts were indicative of failure in about half the cases; in 2007.
One thing to note is drives made now may perform differently than those in the 2007 survey. Different firmware, different recording, better soldering (200x is full of poor soldering because of RoHS mandating lead free solder in most applications, and lack of experience with lead free solder leading to failing connections).
I found, in 2013ish, with a couple thousand WD enterprise branded drives that just looking at the sum of all the sector counts and thresholding that was enough to catch drives that would soon fail. If your fleet is large enough, and you have things setup so failed disks is not a disaster (which you should!), it's pretty easy to run the collection and figure out your thresholds. IIRC, we would replace at 100+ damaged sectors, or if growth was rapid. Some drives in service before collecting smart data did seem to be working fine at around 1000 damaged sectors though, so you could make a case for a higher threshold.
SSDs on the otherhand seemed to fail spectacularly (no longer visible from the OS), with no prefailure symptoms that I could see. Thankfully the failure rates were much lower than spinning disks, but a lot more of a pain.
Hmm, one could encode manufacturer and maybe even model as a categorical or one-hot encoded variable. Then again, the small number of failed drives in their dataset would probably make this hard.
Do we need actual prediction? Ill settle for Just in time warning. This is what SMART parameter C4 gives you. You want to stop using the drive at the first sign of trouble, thats what C4 is good for.
>Only one drive manufacturer had enough failed drives in the dataset to produce a large enough subset, which we’ll refer to as Manufacturer A
Seagate, we will refer to this manufacturer as SEAGATE.
Reallocated sectors is pretty good, but I'd also look at pending and offline uncorrectables to get a potential earlier warning.
I've seen some drives that were apparently working fine with about 1000 reallocated sectors; and the smart firmware usually indicates a maximum of quite a few more than that, so if you replace at 100 problem sectors, I would consider that 'predictive' replacement. It's debatable of course; something that predicted failure based on changes in data transfer (that presumably precede sectors being flagged), would be a bigger predictive step, but I never got that far in modeling; using flagged sector counts was enough to turn unscheduled drive failures into scheduled drive replacements for me.
If prediction would be possible you could do unethical things, like selling off units bound to fail in near future without disclosing this information.
I wonder if there are any existing examples of such practices. Supermarket chains sell products near their expiration date at steep discounts, of course the difference is you know what you are buying and can still eat a 20 days until expiry 50% price Nutella jar.
Assuming your prediction is so good that it completely eliminates surprise failures, OK. But if your software can handle surprise failures then it can handle them, and in that case I don't see the value of the prediction. The only valuable thing I can see is if a system could predict that you were going to have widespread, coordinated failure of so many drives that your ability to reconstruct a stripe was at elevated risk. But such a thing is science fiction (as the article demonstrates) and you can approximate it quite well by assuming that all drives of the same make and model with the same in-service date will fail at the same moment.
If your system is designed for hot swap of failed disks, but 1% of the time, you need to reboot the system for the new disk to be detected; predictive replacement lets you move traffic off the system at a convenient time to do the swap (when traffic is low, and operations staff isn't super busy). Predictive replacement can be deferred more easily, because the system is still working; maybe you can batch it with some other maintenance.
A surprise failure needs to be dealt with in a more timely fashion.
In my experience (which I didn't write a blog post for), monitoring the right SMART values and doing predictive replacement eliminated most of the surprise failures for spinning drives. SSDs had less failures, but I wasn't able to find any predictive indicators; they generally just disappeared from the OS perspective.
You are trading off cost, though. According to that paper I referenced, the SMART reallocated sector count is just binary (the critical threshold is 1 sector). Drives with non-zero counts are much more likely to fail but 85% of them are still in service a year later. If you proactively replace them you may indeed have avoided some surprise failures but it's actually not a great deal more effective than replacing drives randomly, and it costs money.
Also we haven't even discussed the online disk repair phenomenon, in which you take a "failed" disk, perform whatever lowest-level format routine the manufacturer has furnished, and get another 5 years of service out of it. This is done without ever touching it.
As with many tasks assigned by stubborn executives, some things exist only as a monument to their own failure. Suuuuuuuuure boss we will build SMART exactly how you wanted..... to show you why it is a dumb idea
I'm not surprised by the conclusions, I've never seen anyone claim SMART has ever been useful, unless the HDD is already known to have issues that are already noticeable without SMART, SMART just makes it possible to diagnose the actual problem. Though, of course this is limited by my own anecdotal research and personal experience.
What I wonder than, is if sensitive isolated microphones could be tried for this purpose? We already know that sounds (ie yelling at the drive) can vibrate the platter enough to cause performance degradation. If there were internal mics in each HDD recording sound as the HDD spins and correlate that to HDD activity, could that be correlated with failure rate?
If you have spinning drives, monitor the (raw) sector count stats. Add together reallocated sectors, uncorrectable and offline uncorrectable.
If you have good operations, replace when convenient after it hits 100. If you have poor operations (like in home use, with no backups and only ocassional SMART checks) replace if it hits 10 and try to run a full SMART surface scan before using a new drive.
For SSDs, good luck, I haven't seen prefailure indicators.
I've also never found anything meaningful out of SMART at home lab scale. I do know the way vendors report it is a shitshow, so I wouldn't suspect it'd be great training data.
What I'd be really curious to try is run IO benchmarks on the drive and see if there are performance issues that indicate a drive is failing.
I have had drives tell me something was wrong before they developed bad sectors. It does take a lot of monitoring though and probably leads one to replace drives earlier than might be necessary.
SMART Load_Cycle_count told me my WD Red drives were parking heads after a few seconds without use - behaviour you'd expect from WD Green, not WD Red. I hit nearly 100k LCC on all drives in a ZFS pool within a couple of months. When I found the problem I was able to disable the behaviour with idle3ctl.
Just curious: how come the total observations of the confusion matrices are so different? Shouldn't the same data set be used in each trial to fully evaluate the performance differences? Maybe I missed some details in the article...
This was written as an after-the-fact look back after a year long project and the target testing set did change a little along the way. The big change was splitting by manufacturer (and eventually by model family), which changed the number of applicable samples to compare performance on.
I cut some stuff from the article for length, so I totally get how that could have been unclear.
For enriching the model, a few ideas for discriminating failure drives:
Some hard drives have a vibration sensor(s) in them - i wonder if a daemon could sample throughout the day and dump a summarisation of the data when the daily SMART data dump is being generated.
The drives report temperature via SMART, i wonder if anything meaningful can be extracted from the temperature differential between the HDD’s sensor and a chassis temp sensor?
I wonder if more discrimination by manufacturing detail (date of mfr, which factory etc) could help?
I was going to suggest drive supply voltages but those are hard to compare across chassis - calibration of voltage measuring circuitry in PSUs isn’t too hot for the millivolt differences you’d want to measure - the differences would be so small to be lost in the inaccuracies between boards. Also many PSUs lack a method to query their measurements.
Fascinating study and excellent writeup. I have zero ML background but i really enjoyed this.
Great post, glad to see something written from beginning to end. One thing that surprised me though is that there wasn't a single word lost on feature selection (by looking at feature importance for correct predictions) or feature engineering by combining features or somehow else applying domain knowledge. So for example, trying to figure out if you can put different attributes in relation to each other to get a better feature (e.g. number-of-spinups / operating-time-in-hours). Personally that would be the first thing I'd try, and coincidentally it's also the most fun aspect of machine learning for me ;)
I think it’s not actually a very good ML case study for this reason. Feature engineering would be a huge part of a problem like this. Additionally, jumping to start with XGBoost is a pretty amateurish thing to do, and the very first problem to attack is class imbalance.
In the bio blurb they self-describe as an infra engineer who also enjoys data science. In some sense I really don’t like to see that. The quality of the data science / ML work in this is actually quite bad, but people use these blog posts as resume padders to try to jump into ML jobs without ever having any real experience or training.
I think it’s a bad thing because it devalues the importance of real statistical computing skills, which are very hard to develop through many years of education and experience - absolutely not the sort of thing you can get by dabbling in some Python packages on the weekend to do a little project like this.
The amount of waste I see from companies trying to avoid paying higher wages and avoid team structures that facilitate productivity of statistics experts is staggering - with all kinds of hacked up scripts and notebooks stitched together without proper backing statistical understanding, making ML engineers manage their own devops, and just ignoring base statistical questions.
For this drive problem for example, I expect to see a progression from simple models, each of which should address class imbalance as a first order concern. I expect to see how Bayesian modeling can help and how simple lifetime survivorship models can help. I expect to see a lot of feature engineering.
Instead I see an infra engineer playing around with data and trying one off the shelf framework, then claiming the whole premise can’t work in production.
You would probably spend a lot more time and money and arrive at the same conclusion. You can have the best background in statistics and all the experience in the world, but it doesn’t matter if you don’t have the right data. This post just kind of confirms what many people have already shown about SMART data: it’s not predictive of drive failure.
That sounds like an overly dismissive attitude. It doesn’t make much sense to say, “we tried this intern-grade strawman approach with huge flaws, and since it didn’t work, it would just be a waste of money to try a more principled solution.”
This article is the example of wasting time and money. It’s amazing to me the way anti-machine-learning sentiment causes people to do a complete 180 from common sense just to avoid actually investing minimal levels of resource or effort to study and understand problems where statistics can help.
A big observation in my career is that statistics makes non-statisticians go crazy in the head. People panic that statistics will be used to usurp their authority and then try to steamroll statistics with politics about what is or isn’t over-hyped and what would or wouldn’t be a “waste” of time or resources to try out.
"Only one drive manufacturer had enough failed drives in the dataset to produce a large enough subset, which we’ll refer to as Manufacturer A. Because this data treatment was so effective, and because most of the drives that had been in our fleet long enough to begin failing were already from Manufacturer A, we decided to reduce the scope of smarterCTL to only their drives."
I'd have dropped Manufacturer A from the supplier list and used the A only model for the remainder of their drives. Then I'd have another go with SmarterCTL for "not A".
Maybe training with actual bottom-line utility function would have shown some use case, that is, instead of MCC you would have predicted whether sending a person to change the drive now before it fails would cost less than letting it fail and cleaning up afterwards.
But if the drives are part of redundant array, it would be almost always cheapest to let them fail.. and if large part of failures are asymptomatic, you need the array anyway for critical stuff, so I suppose it's a useless exercise.
They sample the SMART metrics once a day. Would more frequent samples would provide a better signal?
For example, perhaps the read errors number increases about once a week. But in the week before a failure, it increases once a day. Or in the day before a failure, it increases once an hour.
OS-level metrics could also be predictive: write latency, i/o queue length, communication errors, and checksum errors in read data.
It would be interesting to empirically identify the state space that leads to increased probability of failure and then seek to understand the rationale for this failure through physics. Ultimately, the ML model is picking up on some combination of factors that exposes physical weaknesses of the device.
My understanding is that SMART is not effective because it only catches (some) mechanical problems and that about half of the time it is the logic board that went bad.
You can filter out some of the bad drives ahead of time, but you will still get blindsided much of the time.
Assuming you don’t have an early fail drive, it has one usable stat, “Power On Hours”. When that begins to creep towards 40,000 you will have a failed drive sooner than later.
As for the rest of the counter, raw read error rate, relocated sector count and friends are also good indications that your drive might not be in the best shape. They don’t say anything about when it will fail though. I’ve had a drive rack up 20 relocated sectors in a couple of weeks, only to stay at 20 for 4 years after that.
And load cycle count is only good for telling if you’re within manufacturer parameters. I frequently see old drives with Load Cycle Count 10-25 times the rated value, and they work just “fine”.
1. Extreme imbalance and survivorship bias. The dataset is unwieldy because of the imbalance, no matter the algorithm; and too small if you balance it.
2. I would take a daily/weekly/monthly diff between different values and time to failure, and obviously only from drives that ACTUALLY failed. All drives fail, but data from healhty drives is useless as it lacks the target value. In practice you want to predict how close they are to failing and act preemptively, tagging them as OK and KO completely misses the target (hehe) IMO.
3. Drive maker and model should be vectorized and fed into the model. Mandatory. Different manufacturers and models show different SMART behaviors, and while some stats are red flags in certain drives, they mean nothing in others.
4. This is more of a personal preference, but I like random trees to test whether the dataset is useful. They're not perfect but they give reasonably good results for most tasks, and tend not to overfit, so you can use it as overfitting benchmark vs other algorithms. If they don't either the dataset is shit or it needs some transformation, or the model needs to be more complex (but that's easy to see). Obviously this applies to classifiers and regressions, and not to many other ML tasks.
Neat ML exercise, though it wasn't the right solution for the problem.
What is the right solution at consumer level though? I currently run smartmontools at regular intervals, compare the TBW with manufacturer's TBW and send a MQTT message using a script.
It’s a bit of a breath of fresh air to see a post like this where someone clearly puts the work in to try and solve a given problem with ML and is able to articulate the failure modes of the approach so well.