
SSD Failures in Datacenters - sharva
http://dl.acm.org/citation.cfm?doid=2928275.2928278
======
spudlyo
I worked at a company that migrated to SSDs on hundreds of HP ProLiant
servers. The performance was great, it really improved our DB latency.
Unfortunately the RAID 1+0 setup that was carried over from the spinning rust
setup didn't work for our most common failure case; corrupted writes.

The HP RAID controller failed to detect corrupted writes on what I seem to
remember were Intel SSDs. The only way we learned about failed SSD drives is
when we started catching large numbers of DB page checksum errors, and by that
time it was too late to do anything about it.

Operationally it sucked, swapping out drives is a lot easier maintenance than
having to rebuild a DB from backup.

~~~
amelius
So based on this, should filesystems include a "verify-writes" mount option?

Or would this functionality somehow be counteracted by a cache inside the SSD?

Perhaps filesystems should include a "ssd" mount option, because there might
be more things to worry about (for example frequent writes wearing out the
device).

~~~
rsync
"So based on this, should filesystems include a "verify-writes" mount option?"

They do. You can, for instance, mount a UFS filesystem "sync", which means
that all writes are synchronous. I assume extX has a similar option ...

~~~
amelius
Afaik, "sync" means that the writes are immediate. "Verify-writes" would mean
that the writes are not necessarily immedate, but that they are verified.

Also be careful with sync on SSD media; this is what the mount manpage says
about the sync option:

> In case of media with limited number of write cycles (e.g. some flash
> drives) "sync" may cause life-cycle shortening.

------
kyrra
For those interested, Google published a similar report a few months back.

[https://www.usenix.org/conference/fast16/technical-
sessions/...](https://www.usenix.org/conference/fast16/technical-
sessions/presentation/schroeder)

[https://news.ycombinator.com/item?id=11188445](https://news.ycombinator.com/item?id=11188445)

------
RUBwkVjwLsDKgPw
All or most of the ssds used in this study are used as a cache in front of
hdds, so take results with a grain of salt.

------
mikevm
Another interesting paper on the issue of using parity RAID with SSDs:
[http://pages.cs.wisc.edu/~kadav/new/pdfs/diffraid-
hs09.pdf](http://pages.cs.wisc.edu/~kadav/new/pdfs/diffraid-hs09.pdf)

------
khc
"An error occurred while processing your request.

Reference #50.d66d717.1467850467.45a102d"

I supposed their ssd failed

~~~
sua_mae
Akamai failed

------
sharva
[http://dl.acm.org/citation.cfm?doid=2928275.2928278](http://dl.acm.org/citation.cfm?doid=2928275.2928278)

------
pcunite
Can someone summarize the results? Should I continue using my SSD, or no?

:-)

~~~
Dylan16807
A typical annual failure rate near but under 1%, so use it like any drive by
having regular backups and expecting to have to use them at some point.

------
more_corn
I did a large production SSD deployment a few years ago on a db fleet backing
a large video hosting site (probably the first one you think of).

tl:dr the benefits were astounding. The failure rates were super low. We saved
something like 2 million dollars by significantly increasing the I/O
performance of the fleet preventing the need to re-shard.

Choosing a drive We went with an Intel MLC (consumer class) drives. No other
drive had such low DOA rates and good match between performance and price. We
set the max lba to 80% of available capacity. (Actually recommended by Intel)
This change eliminated the pathological case where a full drive continually
overwrites the same sectors. The consumer drives suddenly had a lifespan
comparable to enterprise drives (though a slightly slower read speed- which
was fine because we needed balanced reads and writes) We also exported and
monitored the wear leveling stats (available via a SMART value) so we won't
run into the case where they all unexpectedly wear down at the same time. The
projected lifespan looked really good and in most cases exceeded that of the
servers. (in retrospect this turned out to be true)

Raid config We ran in Raid 0 (because I love to live on the edge- No, actually
because we split the boot, data and bin log volumes and data integrity was
preserved by redundancy within the the fleet.) The conversation I had with the
dbas about this was one of the most eye opening conversations of my career.
Turns out if a replica failed or lagged too much they simply tossed the data
and started a new one. They didn't actually need or want RAID10 and that
changes the economics of the picture significantly.

This would have been 4 years ago now. We deployed several thousand drives. Of
that only one ended up with abnormally high wear level. (I proactively threw
it out when we decommissioned the hardware to prevent a problem for the next
guy). I had a handful of drives which didn't pass the pre-service checks and a
another handful that failed in production.

Compare that to 10k spinners which appear to have an annual failure rate of 1
in 20.

SSD Failure profile: We would typically see a drive failing to respond
properly (announcing a crazy size, not showing up at all, failing to save its
max LBA). Most of this happened in the pre-production check phase. Post
deployment failures you can count on one hand (I have the normal number of
fingers ;-)

Qual process We qualified 6-8 different manufacturers, high end enterprise to
the dirt cheap vendor with a three letter name now synonymous with garbage (I
won't even say the name here). The cheap drive had high DOA rates, and the
unsettling tendency to spontaneously reset itself causing a several second
pause (not so good for our uses, but might be tolerable in other
circumstances)

------
smartbit
slides
[http://www.systor.org/2016/slides/ssd_failures_systor2016.pd...](http://www.systor.org/2016/slides/ssd_failures_systor2016.pdf)

------
kyledrake
Summary?

~~~
SFJulie
Compared to HDD SMART yields quite a lot of false negative/positive hence it
is proving harder than HDD means to detect failures.

Failure rates seems to not fit the models and be a tad higher than expected.

Most SSD are reliable, the only problem is they are failing in a way that is
tough to detect leading to consequences we may ignore. (falsely positive
functioning SSD in operations and we may experience silent corruptions)

Given the "too many knobs" of the SSD they can defect in non predictable
snowballing effects (chaotic and dramatic).

Basically this study confirms that IEEE standards for rating MTBF are to be
reconsidered drastically and that our models of failure for SSD are far from
totally being well understood and that SSD is in production while all costs
are not totally yet known. The Cost of SSD specific failures is not yet known
and they urge SSD makers to begin studying the effects of their "knobs" on
reliability

Even though it is written in a very scientific neutral tone trying not to
scare people, it basically can be used as a strong evidence to ban SSD from
critical systems.

It basically is an heavy blow to SSD industry since it attacks it on the costs
model that is based on its expected reliability and says that these drives are
basically still an unknown territory when it comes to failures.

~~~
toyg
You start from the common but misguided assumption that "unknowns" are an
absolute evil, whereas they are just another risk factor to weigh. You offset
the costs of worst-case scenario readiness with the productivity and financial
gains of using the tech.

If SSDs make your business 2x as productive or 2x as remunerative, the cost of
an extra backup or two is simply a blip.

 _> ban SSD from critical systems._

That's an over-reaction if I've ever seen one. I guess it depends what a
"critical system" is for you. If it's something that has to guarantee high
performance under massive load, SSDs are what you _need_ , period. If it's
something that has to guarantee 100-year data storage under a mountain, then
there might be better choices, but they might not involve spinning drives
either.

~~~
wtbob
> You start from the common but misguided assumption that "unknowns" are an
> absolute evil, whereas they are just another risk factor to weigh.

> > ban SSD from critical systems.

> That's an over-reaction if I've ever seen one.

No, it's really not. SFJulie's point is that we don't know enough about the
way SSDs fail to build a good failure model for them, and that we currently
don't seem to have a good way to even detect that they are failing. If you
can't build a failure model for something, you can't build a good risk model
including it. That's kinda her point.

FWIW, I don't know that I fully agree with that point: it seems to me that
cryptographic integrity checks and backups would suffice, but … maybe not?

> If it's something that has to guarantee high performance under massive load,
> SSDs are what you need, period.

We seemed to do alright four years ago, no? SSDs are not a requirement: one
can get equivalent system performance, with predictable failures, by spending
more money on more hard drives and CPUs to talk to them.

~~~
kabdib
> SSDs are not a requirement: one can get equivalent system performance, with
> predictable failures, by spending more money on more hard drives and CPUs to
> talk to them.

Lots more money. Many, many more hard drives, with failure rates to match. And
you're still not going to get there.

We started running a 240TB all SSD array a few months ago and we get sub-
millisecond latencies on most reads and writes. When we see an excursion to
10ms that's cause for concern; 50ms and we do a root-cause. We can peg
multiple 16GBit fiber channels and it's okay. And it fits in half a rack; try
that with disk.

I cannot imagine doing this with spinning rust.

------
chmaynard
Bad URL?

~~~
sharva
Seems to be working for me, anyway this is the url

[http://delivery.acm.org/10.1145/2930000/2928278/a7-narayanan...](http://delivery.acm.org/10.1145/2930000/2928278/a7-narayanan.pdf)

~~~
zbuttram
That link seems equally dead. You might have it in cache still?

~~~
sharva
Ah, might be because my institution has the acm subscription. This is the
paper
[http://dl.acm.org/citation.cfm?doid=2928275.2928278](http://dl.acm.org/citation.cfm?doid=2928275.2928278).

~~~
sctb
Thanks, we updated the submission link.

