
How Reliable Are SSDs? - ingve
https://www.backblaze.com/blog/how-reliable-are-ssds/
======
cbg0
I was really hoping this would actually contain some failure rates. Compared
to this blog post, the wikipedia page for SSDs
([https://en.wikipedia.org/wiki/Solid-
state_drive](https://en.wikipedia.org/wiki/Solid-state_drive)) is more
informative overall, and also doesn't advertise any cloud backup products.

~~~
rodbauer
Roderick here from Backblaze. This post is not based on our experiences with
SSDs in our data centers and is a general article about SSD reliability. At
some point in the future we could have enough experience and enough data on
SSDs to write about how they perform in our own applications. We blog on a
variety of topics that are aimed at different audiences with different levels
of experience. The Drive Stats series is just one of those series, which we
will be continuing.

~~~
time0ut
I admit I felt some disappointment when I reached the end and there was no
stats table. That said, I don't regret reading it. It was a well written post.
Thank you for writing it.

------
true_tuna
I used SSDS in a large datacenter database deployment some five years ago.
Intel drives with sanforce controllers. Of the thousands of drives I installed
I had a handful of DOA another handful of failures when writing maxlba to
overprovision by 20% and to export wear smart value. And after a year or two
exactly one drive that needed to be pulled a because max re-write reported it
was at 70% or write lifespan. Compared to the 10k rpm sas drives that had an
annual failure rate of 1 in 20 I counted it a win. Especially because I was
getting 4x the performance out of 4 drives as I used to get from 12.

Then again there was a huge difference in failure between people who do SSD
well (intel and sandisk) and manufacturers who did things... less well. I
remember the OCZ controller would crash and block write operations for a half
second while it rebooted. Kinda messed with the write throughout numbers.

------
toomuchtodo
@backblaze: Have you done the math to determine the intersection between SSD
capacity and cost, and TCO power and cooling costs (versus spinning disk) over
the life of drives? Also, would write count be a critical measure for your use
case? I would assume not, as backed up bits are mostly static (files split
into blocks, blocks saved indefinitely) except when you're performing garbage
cleanup of user data when they leave or they've removed the file locally.

There's something majestic about a datacenter with no moving parts except
fans.

~~~
rodbauer
Roderick from Backblaze. All of that is of great interest to us and I can
assure you that we are monitoring the TCO of different approaches to data
storage. Due to the very high number of drives we purchase, we have close
relationships with drive manufacturers, who make sure we know about the latest
tech. FYI, they all closely read our Drive Stats posts.

------
MrStonedOne
This is a weird article, it seems targeted at home users, only Stddev is the
primary concern the avg home user should have wrt 'reliability' once you get
past looking at rather or not lifetime is even long enough, and they don't
even address that. It's not "how long does the avg ssd last" its "How likely
is my ssd likely to die before its time".

Step 1: Assume every home user doesn't use backup. Most users have at least
one important thing on their hard drive that is not getting backed up right
now, either because they don't have a backup solution, or because the thing is
located in some non-standard location that is not getting backed up. I can
think of one thing on _my own computer_ that qualifies for the latter case
right now.

Step 2: Now write the article taking step 1 into account. Assume somebody
reading this article is going to use it to decide where to store their
unbacked up important data.

Nobody cares how long the average case is when its longer than they plan to
use the drive.

They care about the likelihood their unbacked up data will be lost because
their ssd died before its time.

They care about how well the firmware is designed to handle error cases, does
it shut down and refuse to power up or does it go read only? Does minor
localized errors in some data cause it to lock out access to all data?

Those are the interesting questions wrt to ssds.

------
dvfjsdhgfv
> If you replace your computer every three years, as most users do

If everybody shares that opinion, no wonder software is getting more and more
bloated, slow, and unresponsive. My main work laptop is an i3 from 2010. I
just replace the hard drives and battery every couple of years. Why on earth
should I change the whole machine each 3 years? If everyone did that, that
would be terribly wasteful.

~~~
meko
I come across that attitude often. I was mocked for pointing out Apex legends
wasn't running very well on a $2200 sager machine from 2013. "Lol you idiot!!!
How can you possibly expect a new game to run on 5 year old machine!"

------
berbec
It seems BB listens here!

[https://news.ycombinator.com/item?id=19174145](https://news.ycombinator.com/item?id=19174145)

~~~
atYevP
Yev from Backblaze here -> we do! Though granted this post was general and not
based on our own horde of SSDs (which we haven't deployed) - but the topic was
interesting!

------
londons_explore
Can any SSD controller design engineers comment here...?

What is the primary cause of SSD failures?

Is it flash wear out of all cells, leaving no good cells to write new data as
people talk about?

Is it flash wear out only of some critical cells?

Is it that cells degrade over time while the drive is off, and when next
powered up, they are too far gone to recover the data?

Is it unrecoverable corruption of critical internal data structures?

Is it unrecoverable corruption of user data (IE. The user could reformat the
drive and have it back to a usable state)

Is it hardware failure outside the typical 'flash wear out' model?

Is it firmware bugs (eg. A badly timed power off leaves data in such a state
that the firmware can't initialize next time)

------
mohaba
Need another run of Tech Report's "SSD Endurance Experiment" with current
drives.

Throw some nvme and optane in there, see if we can wear one of those out.

------
PeterCorless
I personally would have preferred to see some actually MTBF or MTTF data.
([https://blog.fosketts.net/2011/07/06/defining-failure-
mttr-m...](https://blog.fosketts.net/2011/07/06/defining-failure-mttr-mttf-
mtbf/))

------
trainingaccount
Yet another article with a focus on gradual wearing rather than suddenly
turning into a brick, which is the actual common real-world case.

------
ggm
I would ask if there is a recursive quality emerging: we put ram in front of
the rust as cache to make them faster and possibly improve reliability by less
random acts and more full block streaming.

So now we do SSD in front of entire disk systems, do we get the same benefit?
Does a rust RAID run more reliabily if an SSD acts as frontline storage in
some write through model?

------
sadris
That was a frustrating read. You kept scrolling waiting for a table of failure
rates like their other posts. But it never came.

------
camel_gopher
I did some work with eBPF measuring block i/o latencies on SSDs a few weeks
ago. I wish I had a testbench the scale that Backblaze does to really get some
data at scale.

[https://www.circonus.com/2019/01/which-block-i-o-
scheduler-i...](https://www.circonus.com/2019/01/which-block-i-o-scheduler-is-
the-best-we-asked-ebpf/)

------
gaspoweredcat
that kind of failed to answer its own question, i was expecting some sort of
data on manufacturer/model/how long in service before failure etc similar to
things ive seen for platter drives

