
SSD reliability in the real world: Google's experience - ValentineC
http://www.zdnet.com/article/ssd-reliability-in-the-real-world-googles-experience/
======
cpg
I just love how Google datacenters are somehow "the real world". Nice, cool
and controlled temperatures, batch ordering from vendors knowing they are
shipping to Google, stable/repeatable environments, not much mention about the
I/O load, etc.

And then there's

> Ignore Uncorrectable Bit Error Rate (UBER) specs. A meaningless number. ...

> Bad news: SSDs fail at a lower rate than disks, but UBER rate is higher

meaningless, it is.

The real world has wild temperature ranges, wilder temperature _changes_,
mechanical variations well above and beyond datacenter use, and possibly wil
wild loads (e.g. viruses, antiviruses, all sorts of updates, etc. etc.).

It is easier to do stats this way, though.

~~~
hobs
Google is quite famous for running their data centers hotter than others.

* Title makes it obvious, but 95f : [http://www.geek.com/chips/googles-most-efficient-data-center...](http://www.geek.com/chips/googles-most-efficient-data-center-runs-at-95-degrees-1478473/)

* Increased in dns request failures(likely due to said heat) and bad routing cause internal iGoogle services to request this guy's stuff: [https://www.youtube.com/watch?v=aT7mnSstKGs](https://www.youtube.com/watch?v=aT7mnSstKGs)

~~~
Symbiote
To save others the bother: 95°F = 35°C. Warm!

~~~
ehnto
For the sake of anecdote, during an Australian summer my drives can get to
50c, and will usually be around 30c to 40c for the rest of the year.

For fun, my GPU will hover around 50-70c, occasionally hitting 80c. The CPU
around 50c, and the rest of the machine is a mystery!

~~~
jon-wood
Thanks to some utterly awful cooling my laptop CPU idles at about 70c, peaking
in the 80s. I don't know if it's within spec for an i7 or just dumb luck but
its still running just fine.

------
amatic
I had an SSD completely fail after only a month of use. It was a cheap-ish
KingSpec C3000 128GB. It wasn't recognized in BIOS. Surprisingly, doing
something called 'power cycling' made the disk work again. It still works
after 2 years of use.

[http://forum.crucial.com/t5/Crucial-SSDs/Why-did-my-SSD-
quot...](http://forum.crucial.com/t5/Crucial-SSDs/Why-did-my-SSD-quot-
disappear-quot-from-my-system/ta-p/65215) > "A sudden power loss is a common
cause for a system to fail to recognize an SSD. In most cases, your SSD can be
returned to normal operating condition by completing a power cycle, a process
that will take approximately one hour.

We recommend you perform this procedure on a desktop computer because it
allows you to only connect the SATA power connection, which improves the odds
of the power cycle being successful. However, a USB enclosure with an external
power source will also work. Apple and Windows desktop users follow the same
steps.

1\. Once you have the drive connected and sitting idle, simply power on the
computer and wait for 20 minutes. We recommend that you don't use the computer
during this process.

2\. Power the computer down and disconnect the drive from the power connector
for 30 seconds.

3\. Reconnect the drive, and repeat steps 1 and 2 one more time.

4\. Reconnect the drive normally, and boot the computer to your operating
system. "

~~~
dsr_
A $15 USB3-SATA external interface will do this just as well, and you don't
even have to plug it into a computer. As a bonus, it's a good thing to have
lying around a workshop where you might want to temporarily plug in a drive to
investigate it.

(Edit: wow, somehow I missed the line you had about that in your post. My
apologies.)

~~~
tropin
Could you recommend any? At our workshop we use Startech's but I recall them
being way costlier than 15 bucks.

~~~
dsr_
NewEgg has a bunch of nonames for $7-12 which appear to be identical to the
ones in my lab, which also have no identifying marks. Separate AC-DC power
supply brick going to a switch and a 4-pin Molex, separate Molex-SATA power
converter, and a small black box that takes USB on one side and has a SATA
port and 3.5/2.5 PATA on the edges.

------
simoncion
The ZDNet article doesn't say much more than the paper's abstract does.

Perhaps the link should be changed to either
[https://www.usenix.org/conference/fast16/technical-
sessions/...](https://www.usenix.org/conference/fast16/technical-
sessions/presentation/schroeder) , or to the paper itself:
[http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f...](http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/23105-fast16-papers-
schroeder.pdf) (linked from the first link).

------
joenathan
This article is light on details, and looking at the source it would be great
to know brands and model numbers.

I use SSDs in all my builds, servers and workstations. My most used are
Samsung 850 EVOs and PROs, follow by the Intel 750 and Samsung 950 PRO.

Out of bout 80 or so that I've put into production over the past few years
I've had about 3 850 EVOs go bad on me, just completely lock up the machine
they are connected to, can't even run diag. I make sure to use the PRO series
in critical environments, and EVO for budget.

~~~
frik
> I've had about 3 850 EVOs go bad on me, just completely lock up the machine
> they are connected to, can't even run diag

That's worrying. It should at least stay a proper PCI-E citizen, not lock up
the computer. And it should always provide the SMART data, even if the SSD is
dead otherwise. And even if writing isn't possible anymore (safely, due to
many bad bits), it should lock-down to read-only.

~~~
edwintorok
I was using OCZ Vertex 2 SSD in the past, when (possibly due to a firmware
bug) it just wouldn't appear on the SATA bus anymore and had an error LED lit
up. I RMA-d the drive, and they said they have to reflash the firmware, but
doing so would also loose all data because it would erase the AES key. (I
never configured any encryption on the drive, but apparently it does use it by
default).

Needless to say I didn't buy OCZ again, but I'm not sure if this is a general
problem with SSDs or just Sandforce controllers.

~~~
PlaneSploit
I think this was just an OCZ thing.

~~~
josteink
Where I work we have had more OCZ drives fail on us than not.

I wouldn't trust then even with temporary data, no matter what performance
they can push.

------
danso
FWIW Facebook published its own research on flash memory failure
[https://research.facebook.com/publications/a-large-scale-
stu...](https://research.facebook.com/publications/a-large-scale-study-of-
flash-memory-failures-in-the-field/)

------
mvgoogler
It's fun to see this data come out. I did some of the early reliability tests
for these devices. At the time, Google hadn't publicly announced that we were
using SSDs in the data-center so there wasn't really an opportunity to publish
or talk about it.

The most surprising thing to me was the correlation between age and RBER.

I would not have guessed that normal aging would play a significant role in
reliability. It would be fun to understand what is happening there.

~~~
keithpeter
Age means chronological age?

So if I set up an SSD then take it out of the machine and put it on the shelf
the data loss would be similar to one that I left in the machine?

Just checking.

If so, it's got to be cosmic rays or some kind of chemical change in the chips
(anyone else remember tin whiskers growing on early RF transistors?)

~~~
zaroth
In fact SSDs are not designed to retain data over very long periods while
powered down.... _particilarly_ when exposed to higher temperatures, and if
they have seen lots of writes.

The interesting note from this abstract is I guess they saw that number of
writes didn't seem to play as much of an effect. I don't known if they tested
temperature as an independent variable.

I suspect higher temps played a role in influencing Google's results. Higher
temps while powered on will increase write endurance. If they were ever
powering off drives for efficiency purposes this would also have an effect on
read errors, which also gets worse with higher temperature.

I'll quote heavily from AnandTech's article on the topic;

"As always, there is a technical explanation to the data retention scaling.
The conductivity of a semiconductor scales with temperature, which is bad news
for NAND because when it's unpowered the electrons are not supposed to move as
that would change the charge of the cell. In other words, as the temperature
increases, the electrons escape the floating gate faster that ultimately
changes the voltage state of the cell and renders data unreadable (i.e. the
drive no longer retains data).

For active use the temperature has the opposite effect. Because higher
temperature makes the silicon more conductive, the flow of current is higher
during program/erase operation and causes less stress on the tunnel oxide,
improving the endurance of the cell because endurance is practically limited
by tunnel oxide's ability to hold the electrons inside the floating gate."

See: [http://www.anandtech.com/show/9248/the-truth-about-ssd-
data-...](http://www.anandtech.com/show/9248/the-truth-about-ssd-data-
retention)

~~~
keithpeter
Charge leaking away over time - arrays of capacitors - got it. Thanks.

------
MrBuddyCasino
"High-end SLC drives are no more reliable that MLC drives."

"None of the drives in the study came anywhere near their write limits"

I thought a major difference between MLC and SLC is the number of write
cycles?

~~~
edvinbesic
I think you're right, but what they are saying is that it is irrelevant
because the drives/blocks are failing due to age, and not amount of write
cycle.

------
dandrews
Something I was unaware of: single bit (correctable) read errors _are not_
immediately repaired in NAND. Subsequent in-error reads are repeatedly
processed by ECC on the fly, while the media rewrite is scheduled for sometime
later.

This inflates the observed correctable error rate to some extent.

~~~
mvgoogler
FWIW - single-bit errors are not synonymous with correctable errors in flash.

For any given type of flash chip the manufacturer will provide a spec like
"you must be able to correct N bits per M bytes". The firmware for a flash
drive must use forward error correction codes (e.g. BCH or LDPC) of sufficient
strength to correct the specified number of bit errors.

Dealing with a certain amount of bit-errors is just part of dealing with
flash.

For example, a chip could have a spec that you must be able to correct up to 8
bit errors per 512 bytes (made up numbers). If the chip had 4KiB pages, each
page would be split into 8 "chunks" that were each protected by error
correcting codes that were capable of correcting up to 8 single-bit errors in
that chunk. As long as no "chunk" had more than 8 bit errors, the read would
succeed.

So in this case you could theoretically have a page read with 64 bit errors
that succeeded.

This is alluded to in the paper: "The first generation of drives report
accurate counts for the number of bits read, but for each page, consisting of
16 data chunks, only report the number of corrupted bits in the data chunk
that had the largest number of corrupted bits."

~~~
dandrews
Thanks for your gentle correction. (I was thinking of Gray code, as taught to
me during my Wonder Years. You're right as rain of course.)

------
vacri
The first key conclusion threw me a bit. Easy to misread as "Ignore UBER"
rather than "Ignore UBER _specs_ ".

~~~
mvgoogler
The article talks about ignoring UBER _specs_ the paper doesn't say that (at
least that's not what I read).

From the paper:

    
    
        We find that UBER (uncorrectable bit error rate), the
        standard metric to measure uncorrectable errors, is not
        very meaningful. We see no correlation between UEs
        and number of reads, so normalizing uncorrectable errors
        by the number of bits read will artificially inflate the
        reported error rate for drives with low read count."
    

Basically uncorrectable errors are not correlated to the number of reads, so
computing an UBER by dividing the number of uncorrectable errors by the number
of reads is meaningless. Therefore UBER is a meaningless metric.

They don't differentiate UBER from a manufacturer spec and UBER as measured in
production.

------
mtdewcmu
I was expecting to read about drives being bricked by buggy firmware. I guess
they finally solved the firmware side.

~~~
thrownaway2424
According to the paper, these are full-custom drives with custom firmware.
These are not commercial SSDs.

~~~
coldtea
Where did you get that "full-custom" thing?

The FA says they tested: "10 different drive models" of "enterprise and
consumer drives".

~~~
tangentspace
The original paper states the following:

The drives in our study are custom designed high performance solid state
drives, which are based on commodity flash chips, but use a custom PCIe
interface, firmware and driver.

------
drallison
Bianca Schroeder speaking on SSD reliability in the Stanford EE Computer
Systems Colloquium:
[https://youtu.be/60OmhRJ0CUA](https://youtu.be/60OmhRJ0CUA). The abstract for
her talk can be found at
[http://ee380.stanford.edu/Abstracts/160224.html](http://ee380.stanford.edu/Abstracts/160224.html).

------
ntaylor
> _30-80 percent of SSDs develop at least one bad block and 2-7 percent
> develop at least one bad chip in the first four years of deployment._

Did he perhaps mean 3-8 percent?

~~~
pkaye
No, almost every NAND chip will have bad blocks. The initial ones are
discarded during drive initialization in the factory. Then they ideally do
some burn-in which may catch a few more bad blocks. Then over the life of the
drive you may see thousands of defective blocks depending of capacity and
number of NAND dies. However the drive can be designed and tested to tolerate
to a expected defect limit with no data loss. Now if there is a bad chip, that
can also be handled but the recovery process can be high stress on the SSD
like in RAID recovery.

------
dheera
More numbers with specific models would be nice. It would be perhaps useful to
do a cost-benefit analysis over time of better SSDs vs. cheap SSDs with more
redundancy and backups.

------
smegel
> drive days

What is a drive day?

~~~
biot
Much like a man-hour is a measure of one human working for one hour, a drive
day is one drive operating for one day. So a million drive days could be a
million drives for one day, a thousand drives for a thousand days, or any
combination.

------
late2part
I'd love to hear about folks' experience with the Samsing 850 Pro...

~~~
maaku
Power failures result in corruption. (To be fair, true of just about every SSD
except the Intel DC line. But I have personally experienced this with the 850
Pro.)

~~~
macross
Some higher end (e.g. enterprise) SSDs have some form of power backup (e.g.
capacitors) to protect against power failure induced corruption. SSDs with
power backup tend to be more expensive though than the consumer grade devices
(such as the 850 Pro).

~~~
maaku
The Intel DC (= datacenter) line which I mentioned in my comment has this:

[http://www.intel.com/content/www/us/en/solid-state-
drives/da...](http://www.intel.com/content/www/us/en/solid-state-drives/data-
center-family.html)

I'm sure there are others, but I don't know about them. Having surveyed the
market myself recently, there's no consumer-grade SSD with such a feature (but
the Intel ones aren't too pricey).

------
ape4
Ah UBER is a meaningless number (not a taxi company)

