
Why I'm usually unnerved when modern SSDs die on us - stargrave
https://utcc.utoronto.ca/~cks/space/blog/tech/SSDDeathDisturbing
======
pkaye
I worked on SSD firmware for quite a long time and here is my perspective.

Early flash used to fairly reliable with almost minimal error correction.
However with increasing density, smaller processes and multi level cells, it
has gone progressively less reliable and slower. Here are some of the things
that we need to worry about:
[https://www.flashmemorysummit.com/English/Collaterals/Procee...](https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2016/20160808_PreConfH_Parnell.pdf)

To compensate for all these deficiencies, the SSD architecture and hence the
entire FTL becomes very complicated because any part of it can become damaged
at any time. We always have to have backup algorithms to recovery from any
scenario. Its difficult to build algorithms that can recovery from arbitrary
failures in a reasonable time. I cannot have a drive sitting around for 20
minutes trying to fsck itself.

Another problem is that the job while rewarding is not very lucrative. The
chance of a multi million dollar payoff for an employee is low. I have a
higher chance working on a web connected gadget to become a millionaire. So
that means it is really hard to recruit those who are top notch programmers
who known how to figure out the algorithms, write the code, debug the
hardware. Most new grads these days are interested in python, javascript and
machine learning.

~~~
loeg
Users and administrators almost certainly prefer a 20 minute IO latency over
data corruption. Host operating systems should probably flag an IO as failed
long before 20 minutes and then you _know_ 1) nothing made it to disk and 2)
have some chance to avoid introducing additional corruption, if e.g., the OS
is smart enough to kick out the drive when this happens.

> Another problem is that the job while rewarding is not very lucrative.

Do you mean it's lower paying than typical bigcorp software jobs outside of
FAANG, or just that there aren't a lot of startups with astronomical
valuations in the media FTL space?

~~~
xyzzy_plugh
Both, and usually significantly.

Last I checked it was nearly twice as lucrative to be a Ruby-on-rails
developer than an embedded engineer.

Embedded also attracts a certain type of engineer, usually very smart and able
to manage extreme complexity with attention to detail but at the cost of
anything resembling readable, let alone maintainable, software. The fact that
anything at all works in the modern world is amazing.

I left the embedded space and have never looked back.

~~~
howard941
So $SALT_MINE, a highly profitable privately held Fortune 500 just decided to
revamp their pay scales. They've now decided that they want to be at the 50
percentile remuneration-wise in the durable goods sector. That is the white
goods sector, washing machines and such.

I predict we will lose all of our engineers - embedded dudes included.

~~~
chrisbennet
Yeah, but their numbers will look good for a year or two and by then they
(CEO, etc) will have moved on to another company before things crash.

------
niftich
Not that spinning HDDs are really any different, but SSDs are a perfect
example of an entire computer that you attach to yours, and speak with through
one of the (many) storage-oriented protocols. The device itself is a black
box, and complex transformations take place between the physical persistence
of the data and the logical structures that are exchanged on the wire. There
are many layers of indirection, and many things that can go wrong, from fault
with the underlying physical storage, a physical fault in the controller, or a
logical (software) condition in the controller that puts it in an
unrecoverable state.

Spinning platter drives have parts that form a more relatable metaphor to
humans' notions of wear and tear: skates of magnetic readers flying on a
cushion of air above a rapidly rotating disc, with the gap separating a few
dozen nanometers, often smaller than the process size in the controller's
silicon. They have arms that can move the head over a particular disc radius,
and a motor that spins the entire stack of platters. These mechanical
components exhibit wear proportional to their use -- this makes intuitive
sense, and is also recorded in the SMART attributes, so drives in old age and
of many park cycles can be replaced preemptively before they catastrophically
fail.

SSDs are missing many of the usual mechanisms that would contribute to
physical wear leading to sudden catastrophic failure in advanced age. This
means that irrespective of their failure rate vs. HDDs, a higher proportion of
their catastrophic failures are the fault of the controller. This is
discouraging: essentially, the "storage layer" is now quite reliable, so the
fallibility of the human-programmed controller is brought into light.

~~~
cyphar
> skates of magnetic readers flying on a cushion of air above a rapidly
> rotating disc, with the gap separating a few dozen nanometers, often smaller
> than the process size in the controller's silicon.

Complete aside, the fly-height of a magnetic head is actually fractions of a
nanometer (i.e. hundreds of picometers).

EDIT: I got this from a talk by Bryan Cantrill[1]. The fly-height is allegedly
0.8 nanometers (800 picometers).

[1]:
[https://youtu.be/fE2KDzZaxvE?t=1551](https://youtu.be/fE2KDzZaxvE?t=1551)

~~~
hwillis
I do not believe that.

1\. I can't find a source that says less than a few nanometers.

2\. 300 picometers is roughly the diameter of a helium diatom. The head cannot
possibly float through hydrodynamic means if an air molecule can barely even
fit under it.

~~~
blattimwind
> The head cannot possibly float through hydrodynamic means if an air molecule
> can barely even fit under it.

It can. Since siblings liked airplane analogies, here is another one: Consider
the head to be an airplane. It has somewhat wing-similar features which
provide a lifting force, but the actual read/write head sits below those
features (like, say, a landing gear is below wings).

~~~
Serow225
This is correct :)

------
Waterluvian
"When a HD died early, you could also imagine undetected manufacturing flaws
that finally gave way. With SSDs, at least in theory that shouldn't happen"

Why shouldn't it? Isn't it just hardware too?

"With spinning HDs, drives might die abruptly but you could at least construct
narratives about what could have happened to do that"

Why can't you do the same with SSDs?

It feels like the author's main complaint is the frustration of not
understanding SSD hardware as well.

Is this a valid complaint? Are SSDs magical in some way? I'm not an expert
but... It's just hardware with pieces that do stuff. Why can't we come up with
an understanding of why it fails?

~~~
rsync
"It feels like the author's main complaint is the frustration of not
understanding SSD hardware as well."

What is so frustrating about SSDs is how very poorly they compare to previous
incarnations of solid state storage.

Using Disk-On-Chip and/or IDE-pin-compatible CF cards, I had many, many
devices in the field that lasted, mounted read-only, for _decades_ An entire
sect of the computing industry came to rely on these parts as alternatives to
spinning media that could not mechanically fail.

This is not the case with SSDs at all. They fail left and right, even mounted
read-only, for all manner of complicated and _interesting_ reasons. It's very
frustrating that SSDs are not a step forward in reliability from spinning
media and are _a step downward_ compared to (for instance) a 16MB consumer CF
card from Sandisk, circa 2000.

rsync.net filers, which need a boot mirror, are always constructed with two
_unrelated_ SSDs - usually one Intel part and one Samsung part - so that when
the inevitable usage-related failure occurs, _it does not occur simultaneously
to both members of the mirror_ which have, being a mirror, been subjected to
identical usage-lives.[1]

We shouldn't have to do that.

[1] I can't overstate this - if you need a RAID _mirror_ , do not use
identical SSDs for the two members of the mirror. There are many, many cases
of SSDs failing not due to "wear" or end-of-life, but due to weird usage edge
cases that cause them to puke ... and in a mirror, you give the two parts
identical usage ... we either get two different generations of Intel part
(current gen and just-previous gen) or we get current Intel and current
Samsung ...

~~~
Mister_Snuggles
I've heard very similar advice for non-SSD mirrors too. Use different
manufacturers or, at the very least, use different batches of disks from the
same manufacturer.

~~~
blattimwind
Most importantly, never assemble an array from drives that took the same
shipping path i.e. came out of the same parcel.

------
nneonneo
A major problem with SSDs seems to be “firmware death” - where the flash chips
are physically fine (or mostly fine), but the firmware (or firmware memory)
has gotten corrupted due to some programming error, electrical glitch, or
cosmic ray. I’ve had scores of older SSDs die after things like power outages
and sudden shutdown events. This is super frustrating because the data is
physically OK but the controller just isn’t responding to any requests
anymore.

I wonder if there’s an easy way to distinguish a controller failure from a
flash failure from the behavior of the device over the last few
seconds/minutes of operation. In theory a controller failure should cause a
fairly abrupt loss of service, but I’m sure there are soft lockup failure
modes too.

~~~
sebazzz
I have seen some weird issues with SSDs. I had an OCZ Vertex 2 die on me
multiple times, but one thing that stood out most is that after a power cycle
or complete system shutdown (note: reboots were just fine) everything I had
done last time - install software, update Windows, create files - was gone.
The state was reverted to before the time it booted. It was like my computer
contained some kind of Reborn chip, except it was the Sandforce controller
malfunctioning.

~~~
ricardobeat
Modern SSDs can have very large DRAM write buffers (8-256MB), this is pretty
plausible if it was failing to flush it.

~~~
Skunkleton
It was probably some kind of wear leveling metadata that didnt get flushed.

------
lisper
This is not a technological problem, it's a cultural one. These problems are
easily fixed ("easily" by the standards of technical problems that regularly
get fixed in other regimes). The reason they don't get fixed is that the
customer reaction to failures like this is to rant at the mysterious storage
gods that are making their lives miserable.

Needless to say, there are no mysterious storage gods. These are artifacts
made by humans, and somewhere out there, there is an engineer who either
understands why these failures are happening, or knows how to engineer these
devices in such a way that when these failures happen, the cause can be
determined, and then a design iteration can be done to reduce the failure rate
and make the failure modes more robust. The reason this doesn't happen is that
customers aren't demanding it. If major purchasers started demanding,
essentially, an SLA from their SSD manufacturers, with actual financial
consequences for violating it, you would be amazed how fast all of these
problems would get fixed. But instead we vent our frustrations in blog posts
and HN comments :-(

~~~
tzs
> Needless to say, there are no mysterious storage gods

What about Consus, the god who protected grain storage in the ancient Roman
religion? [1]. Or Eopsin, the Korean goddess of storage? [2]

[1]
[https://en.wikipedia.org/wiki/Consus](https://en.wikipedia.org/wiki/Consus)

[2]
[https://en.wikipedia.org/wiki/Eopsin](https://en.wikipedia.org/wiki/Eopsin)

~~~
rabidrat
Those gods are not so mysterious, are they.

------
shittyadmin
I've experienced a few seriously strange issues with modern SSDs, even some of
the better ones.

I had a 512GB Samsung drive that became very slow randomly at doing IO
operations, the whole machine would die for 10-30 seconds at a time once or
twice a day while any process that tried to use the disk became blocked on IO.
Then it'd come right back like everything was perfectly fine.

Issues like this definitely worry me, we're basically completely blind as to
what those controllers and flash chips are actually doing. Not that it wasn't
a similar situation with HDD controllers before, but at least it didn't seem
as unpredictable.

~~~
spatular
If on Linux, try running fstrim from time to time. More free space makes life
of SSD's garbage collector / defragmenter much easier. I've anecdotally
noticed that running fstrim reduces freezeups under heavy load from 1-2s to
almost nothing on my Toshiba drive.

~~~
shittyadmin
On Windows 10, I believe it's supposed to automatically TRIM free space on
NTFS.

------
docker_up
I worked at a storage company, and they reinforced to us how not only does the
OS lie to us, but the hard drives also lie to the OS. So you can't take
anything you get from a hard drive as reliable, you have to test the data once
you get it, ex through CRC, etc. Data can get corrupt at any time.

As densities of data get higher and higher, it doesn't take much to have a
catastrophic data failure. The only way to protect against this is having
multiple replicas of your data.

~~~
blattimwind
Application to OS: Did you write that data out to disk?

OS: Yes, I did.

OS' inner voice: Nah, I didn't, but I'll do it soon, I think.

OS to drive: Those write commands, they're durable yet?

Drive: Sure!

Drive's inner voice: Nah, they're still in my RAM, but I'll probably write
them out any time now.

------
mshook
In a way we need someone/something like Backblaze doing a SMART report about
SSDs to let us know which SMART metrics we should be monitoring...

Because they've shown most metrics are kinda useless or mean different things
from one manufacturer to the other.

[https://www.backblaze.com/blog/what-smart-stats-indicate-
har...](https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-
failures/) [https://www.backblaze.com/blog/hard-drive-smart-
stats/](https://www.backblaze.com/blog/hard-drive-smart-stats/)

------
pmden
"We had one SSD fail in this way and then come back when it was pulled out and
reinserted, apparently perfectly healthy, which doesn't inspire confidence."

We've experienced exactly the same thing. Our general course of action is to
perform a hard power cycle of the server through IPMI - a warm cycle doesn't
seem to work. I've always presumed it was down to dodgy SSD controller
firmware given the way it suddenly stops appearing in the output of fdisk -l.

~~~
phaedrus
I have three SSDs in three different laptops/desktops. In their current host
machines they've been working flawlessly for a couple of years. Prior to my
figuring out which SSD paired best with which host machine, I experienced
intermittent strange and catastrophic problems (unreadable sectors to complete
data loss) with each one. These were different brands, different capacities,
bought in different years.

It's sort of a devil's bargain - the performance of SSDs is so much better
that I can't pass up using it over a spinning disk even if they occasionally
lose everything. There was a great game for the original Nintendo called
"Pinball Quest". As you advanced through the game you could get upgrades such
as side stoppers, stronger flippers, etc. You bought these items from a demon
in between levels. After the red "Strong Flippers", the next upgrade was the
purple "Devil's Flippers". The trick was that occasionally they'd turn to
stone when you needed them and possibly cause you to lose the pinball. But
they were such an upgrade over the Strong Flippers (when they weren't turned
to stone) that you bought them anyway.

SSDs are kind of like that.

~~~
jug
It also doesn’t help that at least Windows 10 is seemingly now ”Optimized for
SSD” in the sense that performance is quite terrible on a traditional HDD. I
imagine this will become more and more common as seek times and hard drive
thrashing becomes practically invisible to users as well as developers. It
will just get harder to go back as time goes on.

------
CosmicShadow
I've had a few SSD's give me random issues, and they are so hard to pin down,
sometimes they just work, other times they abruptly stop, or just aren't
detected until like 3 reboots later and they work fine. After you've had
troubles they make you feel like you are hanging on a hope that the ground
won't fall out from under you.

You also CAN'T HEAR if there is an issue, which acts as another warning sign
that something might be going wrong or will go wrong soon. Loud tickings or
clickings or overwork is a sure sign to start backing up and get ready to buy
a new drive!

------
Severian
Call me crazy, but I don't think that a Crucial MX300 is the best choice for
an enterprise worthy ZFS drive. I get what the author is concerned about, but
I wouldn't be that surprised that a consumer level SSD failed in what sounds
like a heavily used fileserver.

~~~
pythonpatrol
Crucial? I can tell you most of them got spectek's chips inside, no wonder why
they fail.

~~~
gruez
what's wrong with spectek chips?

~~~
happycube
They're Micron's B-stock that didn't pass quality tests. i.e. they'll probably
die sooner.

------
lucb1e
I don't know why my hard drives died either. And while a physical motor
breaking is more tangible, a contact wearing out is also imagineable. I don't
really care why ssds or hdds die, I care that they do and therefore I have
backups (well, ideally I would). I've had spinning rust fail on me while I was
sitting at it and it didn't help me save it, it might as well have been dead
in zero seconds.

~~~
creeble
I don't really care why hard drives die either, but I like that, more often
than not, I get some warning. SMART logs, or weird kernel complaints, in my
experience, are frequent precursors.

I'm a little scared about my new SSDs that have replaced a few rust-spinners
in our data center.

~~~
soneil
That's the big difference for me. Drive's making tictictick sounds? Kernel log
full of bigScaryErrorsLikeThis? It's time to ditch that disk before the disk
ditches you. Make it happen. Panic-Backup if you need to. etc.

Every SSD failure I've had, the failure mode was "what SSD?"

Now, I realise most people should ponder their backup regime before the
tictictick, not after. But as the phrase goes "The best time to plant a tree
was 20 years ago. The second-best is now." The SSD equivalent is "The be-
nope, too late."

They're just terribly unforgiving, which doesn't fit with a culture that
values cure over prevention.

------
JohnFen
It may be irrational, but I remain very distrustful of SSDs, in part for
reasons like this. I use them occasionally as temporary storage, but I don't
use them for anything that would cause me a headache if the drive died without
warning. So far, my observation is that their lifespan is considerably shorter
than spinning platter drives, and spinning platter drives typically give
plenty of warning before actually dying.

Perhaps I'll grow more comfortable after another decade or so, when there is
enough real world experience to go by.

~~~
soneil
I'm just distrustful of drives. All storage is essentially cache, and should
be treated like it.

I'm of the opinion that harddrives don't actually function, they just maintain
the illusion while they wait for a more interesting moment to die.

~~~
anticensor
Why do their motors move, then? For making noise?

~~~
MrEldritch
For scratching the platter, obviously.

------
bluejay2387
Maybe I am being overly simplistic, but shouldn't it not matter?

Who in the modern age doesn't back up everything all the time? Don't we all
operate with the assumption these things are going to blow at any time? 90%+
of my data is on cloud storage now anyway. When a SSD goes out don't you just
chunk it in the drawer of old drives that you promise to take to the disposal
center this weekend (and never do) and then take a quick trip to your local
computer store for a new one?

This reminds me of something an IT support staffer told me a long time ago...
"The difference between a IT pro and a user is that to an IT pro hard drives
are a consumable resources".

~~~
Kye
Replacing an SSD is not free, and in most cases it's not easy. Maybe an IT pro
can just roll down to the computer store for a new one, put it in their laptop
(for free!), and throw a $100+ drive in a drawer without even thinking about
warranty, but most people can't. A backup doesn't excuse excessive rates of
failure and weird glitches.

~~~
SomeHacker44
Replacing an SSD in a modern Apple laptop is literally impossible. You need to
replace the whole dang laptop (or motherboard, whatever they call it these
days), which is not something a user can do.

Thank goodness for Backblaze, Time Machine, Carbon Copy Cloner, Drobo and
Synology. Maybe I have gone overboard, but I have not lost any data in 12+
years.

------
rkagerer
I've done a bit of ad-hoc reliability testing with SSD's.

Some years ago I got a great deal on several Pacer disks and wrote a program
to write a pseudo-random sequence of data (using a known initial seed) across
the entire disk and read it back and compare. Part way through, the data
didn't match. No ECC errors, nothing raised by the filesystem, just mismatched
bits which came back in a manner which tried to "trick" me into thinking they
were good data. This happened on like 5 of the 8 disks. Needless to say I sent
those crappy SSD's back to the manufacturer (unfortunately only got a 2/3
refund) along with some harsh words for their engineers.

I've had more name-brand SSD's fail, in various manners (even on well-reviewed
Kingston drives). Sometimes in ways that can't be accessed at all, other times
(at best of times) in a manner which doesn't allow writes but still allows
reads (albeit at a trickle of a datarate).

These days I use solely Intel-based, top-line SSD's, and some (very limited)
Samsungs. The choice isn't based on empirical data, but rather and impression
their bar is a little higher (or more conservative) in terms of reliability,
and simply not wanting to deal with the apparent issues I've seemed to
encounter with other brands. The downtime lost from restoring / reconstructing
just isn't worth it to me. Maybe I'm paying twice as much as I ought to, but
since making the switch many years back it's worked out pretty well and I've
been happy / fortunate.

I run my SSD's in RAID10 using high-end controllers (aside from a few in ZFS).

Just my own subjective experiences, again I'm not doing this at scale.

------
loeg
I recently had a similar SSD failure, although it wasn't in a "new fileserver"
but my daily use 2013 desktop. It was working, then it was producing write
errors corrupting my filesystem, then the whole system died, very quickly.
Fortunately for me, some data was recoverable from the corrupted disk; I had a
local backup from 12h prior, and a tarsnap backup from about the same time
back.

(Um, here's where I have to be critical of tarsnap: their recovery performance
is absolutely abysmal for small files. They're latency bound between you,
their EC2 instance, and the backing S3 store. Think single or double digit
kB/s and then think about how much data you back up with tarsnap. I can't
recommend any other backup provider better, but this is an experience where
tarsnap left me very disappointed.)

Looking at that SSD and my other SSDs' SMART data, they report extra blocks
remaining in SMART, and you can monitor that as it goes down. Ideally you
replace the drive before it gets to zero.

My primary mistake was simply not monitoring that data in an effective way.

I don't think anyone who monitors HDDs has any real expectation that the high-
level SMART yes/no is going to protect them from data loss. Instead they look
at highly predictive factors like "Reallocated_Sector_Ct" or
"Raw_Read_Error_Rate" (or even plain old "Power_On_Hours").

For SSDs it's quite similar: Reallocated_Sector_Ct, Power_On_Hours_and_Msec,
Available_Reservd_Space, Uncorrectable_Error_Cnt, Erase_Fail_Count,
Workld_Media_Wear_Indic, Media_Wearout_Indicator. Maybe NAND_Writes_1GiB.

NVME SSDs provide SMART-like data on logpage 2 ("Available spare", "Percentage
used", "Power on hours"). For some reason the NVME spec does not require media
to accept host-initiated self-checks, so most NVMe drives don't have the same
functionality as smartctl --test. :-(

------
MarkusWandel
For my home setup, at least, it's simple: Put the OS on a dirt cheap 120GB
SSD, and all the user data on a multi-terabyte hard disk. You can always
selectively migrate other performance critical, but can afford to lose, stuff
onto the SSD later. If it breaks, I just buy another one and reinstall the OS.
On laptops that can only take one drive, the SSD is it, but so is awareness
that the data on them has to be considered ephemeral. I've had assorted hard
disks die over the years, from old age, and so far without exception they've
been "mostly" recoverable - might have to give up on a few files that got hit
by bad sectors, that sort of thing. And have been warned about impending
failure by SMART diagnostics.

~~~
MarkusWandel
Quick to add that I also have a backup strategy! Still, a catastrophic failure
of a storage device is very inconvenient.

------
mirimir
My first experience with drive failure was a ~40MB HDD expansion card in a
386. The bearings got "sticky", so the spindle wouldn't start rotating. But
there was a Al tape covered hole, and you could insert the eraser end of a
pencil, and nudge it. So yes, very understandable.

Not too much later, I used Iomega ZIP drives, and experienced the "click of
death". That was sudden, and irreversible, but also very understandable.

For the past couple decades, I've consistently used RAID arrays, mostly RAID1
or RAID10 (and RAID0 or RAID5-6 for ephemeral stuff). I've had several HDD
failures, but they were usually progressive, and I just swapped out and
rebuilt.

I recently had my first SSD failure. And it was also progressive. The first
symptom was system freeze, requiring hard reboot, and then I'd see that one of
the SSDs had dropped out of the array. But I could add it back. At first, I
thought that there was some software problem, and that the RAID stuff was just
caused by hard reboot.

But eventually, the box wouldn't boot, so I had to replace the bad SSD and
rebuilt the array. It was complicated by having sd _1 RAID10 for /boot, and
sd_5 RAID10 for LVM2 and LUKS. So I also had to run fdisk before device mapper
would work.

------
linsomniac
From reading that blog and it's sister post about "flaky SMART data" on those
same Crucial MX500 drives, reminds me that not all SSDs are created equal.

Just like not all hard drives are created equal. My previous job involved a
decade running 10 cabinets of servers an hour away with very little manpower:
we eventually came to find that IBM/HGST drives were a lot more reliable than
others.

We also evaluated some early SSDs, and they were terribly unreliable. We
eventually settled on the Intel drives and they were superb. My new job we've
been using mostly Intel and Samsung Pro drives, they work great. But Dell sent
us a server with some "enterprise SSDs" in it, that we eventually found were
Plextor drives. Those things were terrible. We replaced them immediately with
Intel, but used some of the Plextor drives and had all of them fail within a
year. I'd put the Intel 64GB SLC drives from our 7 year old database server in
a system before I'd put one of those brand new "enterprise" Plextor drives in.

I love Crucial, I buy a lot of RAM from them, but I'm skeptical of switching
to other brands of SSDs. The more experience I have, the more conservative I
get with systems that matter.

------
tuzakey
I had a bunch of Crucial SSDs die a few years back, they'd work for an hour
then disappear from the bus. Reboot and they'd work again for an hour. It
turned out Crucial had a small counter tracking uptime by the hour, it would
increment the counter to an overflow and crash. This failure could just have
easily occur on a spinning hdd.

~~~
saidajigumi
That this was a bug is utterly unsurprising. _How it passed device & firmware
QA_ is an utter bafflement to me, however.

------
dogben
Crucial is using low grade NAND on some of the products:
[https://www.reddit.com/r/hardware/comments/a4uwag/spectek_fl...](https://www.reddit.com/r/hardware/comments/a4uwag/spectek_flash_without_logo_grade_marking_low/)

------
pinebox
I actually much prefer this SSD failure mode: Unlike failing spinning rust
which will happily linger around coughing up bad data (which will then be
written to backups, mirrored drives, etc. potentially creating a huge mess) an
SSD going out like a light is comfortingly binary.

~~~
starbeast
Is the thorny problem of elegant vs graceful degradation. In a raid system you
want something elegant but not synchronized. In a single drive, some sort of
graceful degradation is usually preferable.

[http://www.assetinsights.net/Glossary/G_Elegant_Degradation....](http://www.assetinsights.net/Glossary/G_Elegant_Degradation.html)

[http://www.assetinsights.net/Glossary/G_Graceful_Degradation...](http://www.assetinsights.net/Glossary/G_Graceful_Degradation.html)

------
dooglius
Relevant: there is a project called LightNVM [0] which is pushing for a much
lower level API to SSDs, that allows most of the complexity to be moved into
the host OS (namely, Linux).

[0] [http://lightnvm.io](http://lightnvm.io)

------
__x0x__
To add to the anecdata: My most recent SSD failure happened _when I did the
firmware upgrade_. It worked before the upgrade, the upgrade binary said
'upgrade failed' and the disk vanished and never returned after the 'upgrade'.

------
XorNot
This post, more then any other, just convinced me to pull out my old Unison
file sync configuration (which was really good, looking at it) and get regular
syncs to my NAS (which in turn uploads to cloud storage) working properly
again.

------
Shivetya
having recently swapped 100tb of spinning media to ssd I am awaiting the first
failures. now being a business environment it is all mirrored capacity. so I
guess my question is from the article, are they running on a single device? No
raid or mirror?

I am loathe to even keep my personal data at home on one drive and since I use
an iMac that requires me to have time machine as mirroring/etc of the internal
drive is not truly possible; at least I did not spend enough time researching
it

------
massafaka
I think those drives dying quickly is actually a Good Thing™, because chances
you're backing up corrupt data might become smaller…

With the older drives you sometimes would have a drive die, replace it,
restore your backup only to find that in the process of dying the drive was
actually corrupting some of the data which went into the backups, now you've
got to hunt down the last uncorrupted versions of the data in the backup…

------
bitL
Did you try to bake it in the oven ("reflow")? Sometimes you can add a few
more hours to its life, enough for backing it up.

~~~
LeftTurnSignal
I've had moderate success with sticking failing HDs in the freezer overnight
in hopes of getting it to spin one last time.

Never tried baking a drive (ssd or hd), but I have with a red ring'd xbox360
mobo.

Sometimes the "low tech" solutions still work.

~~~
yeukhon
Curious what is the rationale?

~~~
LeftTurnSignal
I don't know the technical side enough to give any real answer, but tossing
them in the freezer (properly sealed of course) always seemed to help "loosen"
them up enough to get data off.

I've also had some hard drives that you could bring back to life by giving
them a firm knock with your knuckles too.

Doesn't really answer much, but it's a last ditch effort that has saved me
more times than not.

------
lordnacho
This seems like a the very human problem of trying to grapple with
probability.

We have all sorts of knowledge about about it but when something happens we're
still looking for an explanation for each instance.

If you think of it like nuclear decay you'll still be able to say things about
the ensemble, but not each individual member.

------
coreyoconnor
"When a HD died early, you could also imagine undetected manufacturing flaws
that finally gave way. With SSDs, at least in theory that shouldn't happen"

This is incorrect. As much of the argument seems predicated on this I don't
see a real issue.

------
Rafuino
Keep getting a 403 Forbidden error. Anyone have an archive link they can send
my way?

------
n-gatedotcom
Two questions- how do major cloud providers (azure,aws,heroku)handle storage
failures?

What are some best practices for personal hard drive crash warly-warning?

~~~
Kye
Redundancy, replication (being able to recreate one failed drive from a
certain number of other drives), reliability data, and a replacement budget.
That's difficult for personal use.

------
HelloNurse
TL;DR Lack of noises makes SSD drives bad at motivating users to do backups or
use redundant storage: they don't seem to be on the verge of catastrophic
failure.

~~~
cm2187
I think operating systems should be programmed to wipe out a drive completely
once early in the life of every user (around age 20-ish) to burn in their
brain the need to back up!

~~~
SomeHacker44
My father had the same concept in getting into a car crash early in my driving
life. "Now that you have gotten that out of your system (and I'm glad you're
fine), don't ever do it again."

~~~
cm2187
Actually that did happen to me early. Luckily enough no harm done. And a
lesson for the rest of my life.

------
bepvte
Anyone have a good guide on buying reputable ssds?

