
Investigation: Is Your SSD More Reliable Than A Hard Drive? - bane
http://www.tomshardware.com/reviews/ssd-reliability-failure-rate,2923.html
======
AngryParsley
One thing the article didn't address is performance over time. Even with TRIM
support, SSDs get slower as they're used. Occasionally, cells die prematurely
causing spare capacity to decrease. This won't affect read speeds much but it
will hurt writes. These older SSDs will still be faster than hard drives but
they won't be as fast as you'd expect them to be.

Now for some of my own data. Here's an Intel X25-M G2 after a lot of usage:
[http://abughrai.be/pics/ssd_erase/Screenshot-160%20GB%20Soli...](http://abughrai.be/pics/ssd_erase/Screenshot-160%20GB%20Solid-
State%20Disk%20\(ATA%20INTEL%20SSDSA2M160G2GC\)%20%E2%80%93%20Benchmark.png)
and here it is after an ATA secure erase:
[http://abughrai.be/pics/ssd_erase/Screenshot-160%20GB%20Soli...](http://abughrai.be/pics/ssd_erase/Screenshot-160%20GB%20Solid-
State%20Disk%20\(ATA%20INTEL%20SSDSA2M160G2GC\)%20%E2%80%93%20Benchmark-1.png)

The G2 has TRIM support and this drive was used on an ext4 filesystem with
TRIM enabled. After the erase, performance was almost back to that of a
pristine drive. There was a 6GB swap partition on the drive as well as the
ext4 partition. I'm pretty sure swap partitions aren't trimmed, so that could
have been the reason for the performance degradation.

~~~
caf
The kernel should really issue a large TRIM for most of the swap partition at
swapon time.

~~~
caf
...and in fact, it does so:
<http://lxr.linux.no/#linux+v3.0/mm/swapfile.c#L2117> (and has since kernel
2.6.29).

Additionally, if you set the SWAP_FLAG_DISCARD flag to swapon(2), it'll issue
TRIMs for swap space that is freed at runtime. You'll need a pretty recent
/sbin/swapon to support that flag, though.

------
jrockway
I don't know how the data works out for this, but I imagine that SSDs in
laptops fail a lot less often than spinners do. "Enterprise use" is the
perfect place for rotating storage: the machines aren't dropped off desks,
they aren't power-cycled very often, and they aren't stuffed into backpacks
while turned on (nearly causing a fire). SSDs don't care much about these
things, while spinning disks do. So I imagine that if you have an SSD in your
laptop, you are less likely to experience drive failure.

(Another nice thing: those new micro-SATA SSDs are small enough that you can
conceivably RAID-1 them in a tiny laptop!)

------
bengl3rt
At the moment, the reliability problem with SSDs stems entirely from firmware
bugs, rather than the underlying flash technology. All the issues you hear
about with regards to drives causing blue screens or simply failing to be
recognized by the system at all after a while are issues with the firmware on
the controller chip - the actual flash chips themselves are pretty dumb and
rarely fail catastrophically.

This will get better with time, as SSD firmware accumulates the kind of run
time (in terms of number of hours x number of units in use) over years that
HDD firmware has had.

~~~
kabdib
It would be nice to see the SSD manufacturers get a clue about recovering
data. Having the device fail catastrophically is nuts; it means that all of
the carefully designed recovery schemes in the file system are basically
worthless.

Yuck.

~~~
bengl3rt
Agreed. Right now it is a performance arms race - IOPS trumps everything.

However, bigger and bigger enterprise vendors are doing rigorous qualification
tests on SSDs for white-labeling purposes and demanding that the firmware be
bulletproof. Admittedly the enterprise workload is a bit different from a
laptop (servers never go to sleep) but they are the ones accumulating the most
runtime given 24 hour uptimes.

------
HaloZero
The short summary is that nobody had used SSDs in any large scale for more
than 2 years and the rates of failure for SSDs in that period of time is
similar to regular hard drives.

Also apparently OCZ SSDs aren't as good as Intel SSDs (based on customer
return rates).

~~~
mrb
I get annoyed when people make generic claims that SSDs from vendor X are not
as reliable as vendor Y.

The truth is that each vendor manufactures multiple lines of SSDs, each
running different major firmware versions, with different hardware revisions.
So reliability variance between models from a single vendor is often _much
more significant_ that variance between different vendors.

In your experience with OCZ, which model do you have experience with? Vertex
2? 3? Agility? Enterprise-class Talos? What about the Intel ones? Were they
50nm NAND-based SSDs? 34nm? You need to share some of these details. Claiming
that Intel > OCZ is devoid of significance.

~~~
biot
The article makes those claims, based mostly on anecdotal evidence rather than
a peer-reviewed scientifically rigorous study. It concludes with: "Giving
credit where it is due, many of the IT managers we interviewed reiterated that
Intel's SLC-based SSDs are the shining standard by which others are measured."

While an itemized breakdown by datacenter isn't available one datacenter
(InterServer.net, with fewer than 100 SSDs) says: "Intel SSD's are night and
day in failure rates when it comes to some other drives. For example the
SuperTalent SSD drives have had an extremely high failure rate including model
FTM32GL25H, FTM32G225H, and FTM32GX25H. I estimate about two-thirds of these
drives have failed since being put into service. With these failures however,
the drives were not recoverable at all. They generally disappeared completely,
no longer being readable. Spinners die much more gracefully with an easier
disk recovery. I cannot compare this to the Intel's SSDs yet since I have not
experienced any failures.". They apparently use Intel's X25-E
(SSDSA2SH032G1GN).

~~~
moe
_They generally disappeared completely, no longer being readable. Spinners die
much more gracefully with an easier disk recovery._

That's a weird thing to say. In a datacenter you _want_ your disks to be fail-
fast. I.e. you _want_ them to drop out of the array early and cleanly.

Nobody "recovers" harddrives in a datacenter, you just swap them.

The worst failure-mode (that is still way too commonly seen) is that of the
"lame drive" that grows bad sectors, predictive failures and all that jazz
over time, causing bus-resets, confused controllers and bringing the whole
array to a crawl until a human steps in.

That said, my experience with SSDs in that regard isn't too good either. I've
seen them fail cleanly but I've also seen a single failing Vertex lock up a
LSI-controller hard...

------
r00fus
Consider this: the hard drive is the least reliable component of almost any
modern PC/laptop. Add to this that it likely contains the most valuable non-
commodity asset: your data.

Anyone not backing up their system is really asking for trouble... which will
happen. Given the ease of use of modern backup systems, and their cost (free
only costing a modest amount of your time), everyone should be doing it. OSX
and Win7 make you feel guilty for not doing it (though OSX's version is
better, both deliver on basic backups).

All this said, the difference between an SSD and an HD is about zero when it
comes to real reliability. Both will fail at odd times, and you should have a
backup, preferably bootable, to get you back to good. An external drive with a
system-imaged startup disk (free for all major desktop OSs) is quite cheap to
maintain.

~~~
rue
> _Consider this: the hard drive is the least reliable component of almost any
> modern PC/laptop._

Is it really? Completely anecdotally I've _never_ lost a HDD, but keep going
through power supplies… it'd be nice to see some data on component failure
rates. I assume some exists, but I haven't really felt strongly enough about
it to go look.

~~~
drzaiusapelord
I've worked in IT for several years. My own fail rate is one or two drives per
year in a 50 computer environment. So in a 150 computer environment its 3 to 6
a year. That's desktops/laptops.

In servers its a completely different game. Thanks to A/C and steady loads its
much, much better. There is a chance of getting a run of bad disks and
suddenly having multiple failures a year, but only on that specific model.
Generally, I'd say that rate is closer to .25-.5 fails for every 50 drives per
year, if that. So over 4 years I can expect one or two drive fails on 50
disks.

Regardless, drives fail all the time on desktops and laptops. The reliability
is a huge, huge problem. Supposedly, SSDs were going to fix this, but their
teething problems are probably making them worse than spinning disks.

------
JoeAltmaier
tl;dr SSD failure rates are no better than most hard drives. Intel failure
rates only report validated errors(returned drive fails in Intel test), real
rate probably 2-3X higher. SMART doesn't work for SLC SSD (doesn't detect
failures early enough to recover). Update your firmware often as failures
initially were mostly bugs and not write-failures.

------
kalleboo
All these statistics come from server use, where drives are constantly
spinning and kept at a relatively constant temperature. I'm curious if there's
a bigger difference in laptop computers, where the drives see a lot more
physical abuse, power cycles, temperature variations, etc.

~~~
mawelsh
I haven't quite been able to grasp why Toms Hardware, which I view largely as
a consumer/enthusiast review site would choose to analyze the failure rates of
SSD's in a datacenter environment. How about going to IT departments that have
"floater" laptops and seeing how long the drives hold up in those? I think the
primary consumer reasons for purchasing SSD's are increased battery life and
physical fault tolerance in laptops not the expectation that they'll outlast a
standard HDD in a server setting.

~~~
dspillett
_> I haven't quite been able to grasp why Toms Hardware, which I view largely
as a consumer/enthusiast review site would choose to analyze the failure rates
of SSD's in a datacenter environment._

It could just be that there is not enough good data from the consumer market
to make solid conclusions from. DCs use drives in large numbers, so you are
going to get "concentrated" readings.

------
mrb
Why is the IT industry so cautious about SSD reliability? We have spent
decades developing HDD fault-tolerance mechanisms or processes such as RAID
and backups. _We should trust them_.

~~~
viraptor
We're currently trusting a drive that might fail in 5 or so years in one
specific place. The rest is recoverable, the place of failure might be
overwritten anyways, the data might not be needed, smart will warn you that
the problem is close in many cases...

With SSD you have a chance of going completely blank. If it happens to be a
firmware issue on a RAID mirror, there's also a chance of common fault in both
drives. Even if you have backups in that case, do you really want to deal with
such situation?

~~~
rbanffy
> smart will warn you that the problem is close in many cases...

According to Google, about two thirds of the time, SMART will warn you.

> there's also a chance of common fault in both drives.

Never, ever trust a RAID array made from identical disks. Whenever possible
use different manufacturers, different models and different batches. Whatever
caused failure of one drive will eventually cause the failure of its twins. If
all twins are in the same array, you won't be happy.

------
iamelgringo
I talked with Greg Lindahl, the CTO of Blekko about their infrastructure. He
came into the office one day with 700 SSD's and said, "Here's our new storage
back end." Their search index is stored entirely on SSD's.

They haven't had a single SSD failure since. Granted, their search index is
read only.

------
ck2
I remember reading early on when SSD first came out where people claimed that
when SSD fails, it fails into a read-only state so at least you do not lose
your data.

But apparently this is not true and it's not how SSD fail, at all.

What's crazy is I have not had a hard drive fail since we passed the triple
digit mark for GB capacity. Last one was a 20GB drive (ah the old days).

~~~
Symmetry
That's what happens when the flash memory wears out, but really SSDs haven't
been around long enough for anyone but the most intense users to wear out the
flash memory of one. Instead the problems people have run into are in other
parts of the SSD - mostly firmware.

------
smackfu
Reminds me of CFLs. In theory, they are much more reliable than incandescent.
Practically, they often stop working just as soon as the old light bulbs
would, and they cost significantly more.

------
dlsspy
I wish people would compare these with IO operations instead of wall time. SSD
is a huge win there, and that matters the most for a lot of people.

~~~
abduhl
This is the main conclusion at the end of the article: that SSDs should be
picked for their performance rather than for their reliability, as SSD
reliability hasn't been proven to be better whereas performance has.
Reliability comes in here because you can replace 4+ HDDs with a single SSD
and thus reduce energy/heating costs and also increase reliability as you only
have 1 drive that can fail vs 4+.

~~~
rbanffy
You should also consider the reliability of a drive decreases at the same rate
the data it holds gets more important. The more important the data, the less
you should trust your disks (and do better backups, better mirroring, better
checksumming...

------
davidw
Be careful about your answer, or the avenging angel Murphy will smite thy SSD
or Hard Drive down.

