

Predicting Hard Drive Failures with SMART Stats - ingve
https://www.backblaze.com/blog/hard-drive-smart-stats/

======
level
This[1] paper from Google also covers how SMART stats relate to drive
failures. This line from the conclusion kind of wraps things up:

> We find, for example, that after their first scan error, drives are 39 times
> more likely to fail within 60 days than drives with no such errors.

So basically, if you are getting SMART errors, you should make sure that data
is backed up (if it isn't already).

Another interesting section is:

> Out of all failed drives, over 56% of them have no count in any of the four
> strong SMART signals, namely scan errors, reallocation count, offline
> reallocation, and probational count. In other words, models based only on
> those signals can never predict more than half of the failed drives.

So, while monitoring SMART errors is a good indicator if your drive is going
to fail, it's hardly failsafe, and chances are you drive will fail without any
notice. SMART stats are interesting, but not the ideal measure of the health
of a disk.

[1]
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf)

------
KaiserPro
Whats interesting for me is that they have almost exactly 10 time the amount
drives we have in production (excluding system drives)

On average we get around 4 failures a week(some weeks more, others none) They
tend to fail in bunches. For us there is a correlation between the way they
are used and failure rate.

One bunch of file servers was being used for particle sims to model black
holes. We pulled petabytes through the array. within 2 months of the show
finishing we had replaced about 30% of the drives. (raid 6 with 14 disk luns,
4 hot spares).

Other have been happy and not killed as many drives. yet they've had the same
amount of data pulled through them. They are also identical. They have the
same drives, raid controller and bought very close together.

One thing to note is that if your work load changes, so will your failure
pattern.

~~~
gaadd33
Is it possible that the drives that died were in similar physical locations?
For example, if you have a dodgy power supply, perhaps the last bank of drives
gets lower quality power, or depending on the temp of the case and airflow,
perhaps the top bank of drives run significantly warmer?

~~~
KaiserPro
Thats certainly possible, although we've done our best to minmise it.

We have coldlogik water cooled racks which almost eliminate hot spots. The
other thing to note is that 4000 drives fit into less than 7 racks. (yup, I
was surprised too) so they are all in the same place.

We also have two transformers, just for us, with lots of sexy power smoothing
(but no UPS, and yes that's a bad thing.)

It could be the raid enclosure its self, that might be part of the problem.
However they _should_ be identical, with the same firmware.

~~~
beagle3
Could it have been something really stupid like a loose screw that would let
vibrations get amplified into dangerous territory?

------
nisa
I'm using a tool called HDSentinel for all Linux machines:
[http://www.hdsentinel.com/hard_disk_sentinel_linux.php](http://www.hdsentinel.com/hard_disk_sentinel_linux.php)
\- it's a small static binary (closed through) that reads the SMART values and
calculates a health formula out of it:
[http://www.hdsentinel.com/help/en/52_cond.html](http://www.hdsentinel.com/help/en/52_cond.html)

I don't have detailed statistics but it's quite reliable in predicting disk
health. All drives I had to change because even ZFS couldn't cope with read
errors anymore had a health below 10%.

I'm using this for Nagios checks instead of other SMART checks as it's more
reliable and the health reporting is all you need to inspect the drive
further. Never got reliable performance values out of the drives we use
through. But that's a SMART problem for certain vendors.

I have nothing to do with them, I just think it's a great tool that allows
watching the values named in the article quite elegant.

------
gvb
The article is interesting, but buried in it is this link to their full stats
on the SMART reports which is fascinating:

[https://www.backblaze.com/blog-smart-
stats-2014-8.html](https://www.backblaze.com/blog-smart-stats-2014-8.html)

------
userbinator
I'm not surprised to see reallocated sector count in there - usually once a
reallocated sector appears, more will come soon so my policy is to
replace/relegate to "unimportant" use (e.g. temporary machines/data transfer)
any drives that have more than 0 of them.

Drives are also very sensitive to vibrations and shock both when powered and
unpowered - I've had experiences with computer cases that mysteriously
"killed" drives (errors showing up within weeks to months of use), and traced
it down to bad design that caused resonance in the drive cage.

------
gburt
A regression fit on ___failure ~ 1 + smart_ __would 've presented this data
nicely and showed how the various SMART stats interact. If our goal is just to
forecast, fitting that regression with a penalized method would've likely
provided a great result.

~~~
gburt
As I received an email about this comment the other day, I wanted to add: this
is certainly not the only (or even close to the best) model you could've fit,
my point was only to spawn discussion here. There are a tremendous number of
ways this data could've been presented that would communicate more than how it
was.

------
Shank
Would it be possible to collect better data by correlating SMART data across
drive manufacturer? Sure it'd be a smaller sample size, but if all Western
Digitals start to indicate failure in the same way (that would be obscured by
having other manufacturer specific data blended in), it might be worth the
effort.

------
rcarmo
Anyone know of anything comparable for SSDs?

~~~
helper
The techreport's SSD endurance series covers which SMART attributes are
relevant to which manufacturers [1].

When we were getting ready to move some production systems over to SSDs we
took one drive and wrote random data to it until it died. After every complete
write cycle we collected the SMART attributes so we would know what a dying
drive looked like.

[1]: [http://techreport.com/review/26523/the-ssd-endurance-
experim...](http://techreport.com/review/26523/the-ssd-endurance-experiment-
casualties-on-the-way-to-a-petabyte)

------
amelius
Interesting. It appears they have done a quite thorough analysis on the data.
But of course there are standard techniques for statistical classification
from the mathematical community. It would be nice if somebody wrote a
classifier for this problem.

------
chiph
> once SMART 187 goes above 0, we schedule the drive for replacement.

This agrees with my personal experience. Once you get any reported errors at
all, it's time to plan the drive's funeral.

------
herf
Sounds like there should be a single metric based on these stats (~=
multiplier for failure risk?)

