
What SMART Stats Tell Us About Hard Drives - ingve
https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/
======
esaym
Ten years ago when I was I trying to learn how to "program", I wrote this bash
script (to be added into /etc/cron.daily) that dumps a few smart stats that
are normally 0 or slow changing, diffs it with the copy from the previous run,
and if anything is different (and cron in configured right) it will email you
the diff. Every linux machine I touch gets this file dropped onto it. I've
replaced many hard drives because of it

    
    
        #!/bin/bash
        
        
        smartctl -a /dev/sda > /root/smartStates
        grep Reallocated_Sector_Ct /root/smartStates > /root/stats
        grep Current_Pending_Sector /root/smartStates >> /root/stats
        grep Offline_Uncorrectable /root/smartStates >> /root/stats
        grep UDMA_CRC_Error_Count /root/smartStates >> /root/stats
        
        touch /root/statsOld
        cmp /root/stats /root/statsOld
        result=$?
        
        if [[ $result -ne "1" && $result -ne "0" ]]
          then
        	echo "Something went wrong"
        	exit -1
        fi
        
        if [[ $result -eq "1" ]]
          then
        	echo "Files are different\n"
        	cat /root/stats
        fi
        
        mv /root/stats /root/statsOld
        rm /root/smartStates

~~~
koolba
(Warning pedantic script review)

Add a "set -e" to catch errors. Say if the disk can't be read or file can't be
written.

Why reuse the same temp file? Make a new one with mktemp and auto clean it via
an exit trap. As it's written this isn't concurrently safe.

Exiting -1 on error? Don't use negatives.

Wrap it all in a main() function and use locals instead of global vars.

~~~
jimmaswell
It's just a small bash script, and one that's apparently worked well for 10
years. Rewriting it to J2EE standards would just be a waste of time; the best
outcome is that it still works the same, and the other outcome is that you
introduced a new bug refactoring it.

~~~
koolba
> It's just a small bash script, and one that's apparently worked well for 10
> years. Rewriting it to J2EE standards would just be a waste of time; the
> best outcome is that it still works the same, and the other outcome is that
> you introduced a new bug refactoring it.

I don't see how following best practices for bash scripting (or really shell
scripting in general) can be compared to J2EE standards.

Seek quality in all your scripting so it becomes the norm. Otherwise you'll
end up having crap like that in something mission critical.

------
mjb
SMART is a fantastic exercise in sensitivity and specificity. As backblaze is
showing with this data, SMART stats have poor sensitivity, but what's much
worse for those who run big fleets of drives is their poor specificity. Lots
of healthy drives are reported unhealthy by SMART. If I'm running a gold-
plated database server, that doesn't matter. A couple of extra planned drive
replacements is a small price to pay for avoiding unplanned failures. If I'm
running a huge drive cluster, it's much, much more expensive.

Take Backblaze's 0.01% for a group of four failures. That's replacing an extra
100 drives per million, at random, and only getting the benefit of correctly
predicting failures 10.4% of the time.

This is great data to have.

~~~
Retric
Thresholds are often useful with these kinds of stats. Aka a drive moving one
sector might mean nothing, but moving 30 in a week could be great predictor.
Further they only have 70k drives across a range of product lines so what
predicts drive X failing very well might say little about drive Y.

PS: Rememebr all RAM gets bit flit errors over time. Which is one of the
reasons rebooting is often so useful, but also means one off errors are often
meaningless.

~~~
sn
ECC ram will either correct the error or can raise a non maskable interrupt if
it can't be corrected.

~~~
Retric
ECC vastly lowers, but does not remove this problem. You can for example get
bit flip errors while doing the checksum.

As a personal user it's a non issue but scale things to ~70k devices and you
get ~1.7 million device hours per day.

~~~
sn
Fair enough. Your original comment did not make this distinction.

I dug and found
[https://www.fiala.me/pubs/papers/sc12-redmpi.pdf](https://www.fiala.me/pubs/papers/sc12-redmpi.pdf)
the title of which is "Detection and Correction of Silent Data Corruption for
Large-Scale High-Performance Computing."

They state that for a cray there was a double bit flip about 1x/day for 75k
modules. To (probably incorrectly) extrapolate, if you have a server with 16
modules that would be equivalent to a single double failure about once every
13 years.

------
latitude
Might be a good time to plug my little baby -
[https://diskovery.io](https://diskovery.io)

If you want to have a quick, but in-depth look at your drives, it'll give you
lots of data, including the SMART table interpreted in a vendor-specific way.
It also understands some RAID setups, and more support for this is upcoming.
Windows only, at the moment.

To explain a bit of a context - SMART data comprises a set of attributes and
each attribute has a value, a threshold and a raw value. Values are opaque
8-bit somethings that are only meant to be compared to thresholds. When they
fall under then, then it may indicate a problem. They aren't really
interesting. What's interesting is the "raw" values, but as the name implies,
they are vendor-specific and require decoding. Some vendors publish the specs,
but most don't. Specs that _are_ published are often incomplete or plain
wrong. So there's a LOT of reverse engineering and guesswork involved, which
makes writing a SMART tool both frustrating and interesting at the same time.
But if you need just the "dying / healthy" indicator, it's a very easy thing
to extract from a drive.

~~~
djsumdog
Has anyone ported your work to Linux or MacOS? I guess not since iIt looks
like it would be very OS specific. It looks like an incredible tool.

~~~
latitude
It is indeed pretty OS specific.

Not the SMART part, but how you talk to the drives and controllers and how
storage is generally sliced into partitions, volumes, etc. Windows has a
fairly comprehensive version of Software RAID, but in true Microsoft fashion
they do things ass-backwards in more than one place. For example, striped
volumes (RAID 0) will use only a part of a partition for each stripe, but to
learn that you'd have to talk to Virtual Disk Service rather than regular
Disk/Volume management API. This is, basically, as unportable as it gets.

------
daveguy
Wow. They provide all of the raw log data from the drives[0]. Looks like an
interesting source of data for a Kaggle competition.

[0][https://www.backblaze.com/b2/hard-drive-test-
data.html](https://www.backblaze.com/b2/hard-drive-test-data.html)

------
tedunangst
Isn't the reverse stat more interesting? What percentage of drives reporting
an error fail within X weeks? I don't want to know how many failed drives had
errors, I want to know how many errored drives fail. (A more accurate title
might be "What failed drives tell us about SMART stats".)

------
arielhn
Small nitpicking here, but the moment it popup a modal dialog asking me to
enter my email for some kind of subscription I simply close the tab. I do this
since three months ago for any unknown website I visit.

Such nuisance for what might be a good read.

~~~
teh_klev
Shame...you could've just dismissed the popup and not missed out on an
interesting article, it's same energy expended but with a nett gain instead of
your loss. A small price to pay for BackBlaze willing to share interesting
stuff like this and hardly the most egregious examples of this type of thing.
Also these types of complaints have been done to death here over the years and
are really, really tedious. Please complain to BackBlaze instead of trying to
take this thread off-topic.

~~~
arielhn
First of all, I was on a public transport when I click on that link, my
'consuming' experience already not optimal from the point of view of
readership. Many technical people, like I do, are busy people with short
tolerance on things that detract from what I'm supposed to read or comprehend.
Unless I can just read right there right away, I'm just going to skip to the
next tab.

Secondly, I noticed that this is a trend right now; where you get to a page
and after a few seconds a dialog just thrown into your face with little
disregard to you (the reader) is trying to concentrate to read the content. To
me that is rude, you don't go to a bookstore while reading the table of
content a salesman grab that book from you and tell you "would you like me to
take your email address so that we can notify you when we have new books
available?" without wondering what kind of establishment that allow this kind
of behavior.

Third, I got the link from HN it was easier for me to go back to this tab,
login, hit reply than registering a disqus account and then enter a comment
there.

With that said, I dont want to _blog_ about this on Medium or whatever, I dont
need clicks by moaning about every little things, this is my way of protesting
on what I perceive is happening right now and that's why I start with "small
nitpicking".

------
d3lxa
As a data scientist, I would be curious to see the application of machine
learning to this problem. I'd start with naive Bayes, logistic regression and
SVM.

@blackblaze I'm pretty sure you can automatize a large portion of your
investigation that way.

------
StillBored
First these counters vary in meaning and support by vendor/model, and it would
be nice if someone were to come along and mandate further standardized ones.
Instead you have to tune everything for each drive model. In this regard SCSI
is a little better (more on that later).

Second, timeouts and uncorrectable errors are generally being reported to the
controller as part of normal operation. So having SMART tracking them is just
a bonus. Either of those two conditions is usually sufficient to kick a drive
out of a functional RAID array because those are data loss events. Most drives
have layers and layers of ECC, so in order to get an uncorrectable error a lot
of bits need to be flipped in the target sector. For that to happen it likely
indicates there is something mechanical going on which is likely to affect
adjacent tracks/sectors. Of course if you never scrub your drives its possible
bitrot accumulates on a perfectly functional device until sectors aren't
recoverable.

In my previous life I found it much more interesting to track the rate of soft
error counts during scrub operations. Particularly, in larger arrays because
sometimes a drive would start getting slower (which is frequently caused by
read retries in the drive itself or problems tracking the embedded servo/etc)
and the correctable error counts would start to steadily rise followed by
actual timeouts/uncorrectable errors. Of course these days, it seems most
drives won't show the correctable error counts because it would freak people
out. Instead you have to infer it from seek errors and relocated sector
counts. Although, it might now be considered a SAS/SATA differentiator. SCSI
has standardized log pages with more detailed information. (random google hit
[http://www.seagate.com/staticfiles/support/disc/manuals/scsi...](http://www.seagate.com/staticfiles/support/disc/manuals/scsi/100293068a.pdf)
page 238) Note the errors are categorized as corrected without delay, with
substantial delay, and corrected on a retry. By comparison the SMART data
isn't particularly "smart".

------
gerosa
Google made a study in 2007 that stated: "SMART models are more useful in
predicting trends for large aggregate populations than for individual
components."

[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf)

------
joenathan
This is good info to know, helps me as a sysadmin to be confident in making
decisions for my customers and their data. I regularly use a tool called
Crystaldiskinfo to check the SMART stats of drives. Will pay more attention to
the raw values in the future.

~~~
takeda
It's interesting that most people rely on the raw values, since the standard
does not require them to be meaningful and depending on the vendor it could be
anything.

I suspect this is because value, worst, threshold columns are kind of
confusing to understand.

~~~
toast0
There aren't too many vendors for spinning disks, and if you have a lot of
disks it doesn't take too long to see that the sector count metrics correspond
to sectors. In my experience, bad sector count is a good predictor of future
trouble, and running disks until they threw read errors (before we were
running smart monitoring), they all had lots of bad sectors. That said,
there's a threshold, getting to 100 slowly is probably ok, a thousand is
probably not.

SSDs though, they just disappear from the bus when they fail; so I haven't
been able to look at a dead one and see what looks like a useful predictor. I
have seen some ssds reallocating a big block, which kills performance while
its going on...

~~~
cuchulain
"SSDs though, they just disappear from the bus when they fail"

This isn't always true, and actually shouldn't ever be true - it's a
particular failure mode you're seeing, and while it appears to be one common
across a number of SSD controllers, it's still a pretty sorry fact that it
happens.

All SSDs (at least all not-complete-rubbish ones) report some kind of
flash/media wearout indicator via SMART, which isn't necessarily an imminent
failure indicator (SSDs will generally continue to work long past the
technical wearout point), but is a very strong indicator that you should
replace it soon and should probably buy a better one next time.

SSDs do suffer from sector reallocations in the normal way, and the same kind
of metric monitoring can be done. It's pretty vendor-specific as to what SMART
attributes they report, but attributes like available reserved space, total
flash writes, flash erase and flash write failure counts and so on are pretty
common.

~~~
toast0
With thousands of sata SSDs, I've seen one fail in a traditional fashion (some
sectors weren't readable, otherwise mostly fine) and the rest of the maybe
hundred that failed would just disappear from the bus. I don't monitor the
wear out indicators, but from occasional looking, we're never near a
significant fraction of the wear capacity. I'm very happy not to have anymore
spinning disks in production, because the ssds fail less often, it's just the
failures are more annoying, because it's hard to have an orderly shutdown when
disks disappear.

~~~
rsync
Funny how ~18 years later I still have compact flash devices plugged into IDE
ports that have never failed. In fact, across a broad spectrum of applications
and installs, I have never seen a working CF device fail in the field.

SSDs on the other hand ...

I use SSDs for caching (ZFS read cache and mirrored SLOGs) and I use them for
mirrored boot devices in modern, production systems that should have a fast OS
device.

But if I want a system to run forever ... if I am optimizing for longevity ...
I use compact flash, even in 2016.

(yes, of course I set them to be read-only and disable swap)

------
nwmcsween
This data should be put into smartmontools or a separate tool to give a simple
good/bad rating as the actual values are somewhat meaningless w/o data.

~~~
bigiain
Hmmm, I wonder if a crowdsourced data collection would be useful/securable?

Would you enable an option for smartmontools that sent all your drive SMART
data to a cloud hosted db (with as little identifying information as possible)
and tell it when you had a drive fail - in return for that same service
alerting you with "best estimates" of your risks of drive failure?

------
caf
_Operational drives with one or more of our five SMART stats greater than zero
– 4.2%_

This doesn't gel with the prior table, which shows that 4.8% of operation
drives have non-zero SMART188 alone.

~~~
mjevans
The data is close enough that it sounds like the sets share a good portion of
common data, but that there are exclusive items in one or both sets.

I could easily see one set based on a different time, or another that missed a
category of drives (EG one also counts drives from testing / non-production
units).

------
Synroc
Interesting, thanks for posting! Could you talk quickly about why it's
interesting to predict drive failure? Is it to understand how many replacement
drives you might need to order in the short term, or is there value beyond
stock management of drives?

~~~
greglindahl
In a non-RAID context, for example NoSQL databases that keep 3 copies of
chunks of data, knowing about a failure in advance means that you can abandon
using that drive slowly, without it becoming an emergency.

~~~
honkhonkpants
If you have three replicas, who cares if one fails? Just wait for it to fail
and rereplicate from the survivors.

~~~
greglindahl
Because there's a risk that another replica will fail. And because you can do
the copying more slowly if the failure is predicted and not super-immediate.

------
mynameislegion
Someone should set up an AI startup for harddrive failure prediction.

------
castratikron
I get a tingling that Bayes theorem applied to these probabilities could lead
to some insights. Might do this later if I find the ambition.

------
catscratch
> Perhaps one of our stat-geek readers will be able to tease out a conclusion
> regarding power cycles.

I think it's obvious. If the drive has errors, this may cause or cause need
for a reboot.

------
rasz_pl
interestingly not tracking 184 'End-TO-END ERROR', considered by couple of
smart tools to be a failure of the drive.

