Hacker News new | past | comments | ask | show | jobs | submit login
SSD reliability in the real world: Google's experience (zdnet.com)
264 points by ValentineC on Feb 27, 2016 | hide | past | web | favorite | 69 comments



I just love how Google datacenters are somehow "the real world". Nice, cool and controlled temperatures, batch ordering from vendors knowing they are shipping to Google, stable/repeatable environments, not much mention about the I/O load, etc.

And then there's

> Ignore Uncorrectable Bit Error Rate (UBER) specs. A meaningless number. ...

> Bad news: SSDs fail at a lower rate than disks, but UBER rate is higher

meaningless, it is.

The real world has wild temperature ranges, wilder temperature _changes_, mechanical variations well above and beyond datacenter use, and possibly wil wild loads (e.g. viruses, antiviruses, all sorts of updates, etc. etc.).

It is easier to do stats this way, though.


>I just love how Google datacenters are somehow "the real world"

They totally are though, especially to the HN crowd where a lot of us may be putting hardware in data centers.

I agree that this isn't going to give us a clear picture of what to expect out of an SSD in say, a netbook or something. On the other hand, data from a million SSD's reported by one company in a controlled environment is a hell of a control group if you want to go test factors like temperature, etc.


Google is quite famous for running their data centers hotter than others.

* Title makes it obvious, but 95f : http://www.geek.com/chips/googles-most-efficient-data-center...

* Increased in dns request failures(likely due to said heat) and bad routing cause internal iGoogle services to request this guy's stuff: https://www.youtube.com/watch?v=aT7mnSstKGs


To save others the bother: 95°F = 35°C. Warm!


For the sake of anecdote, during an Australian summer my drives can get to 50c, and will usually be around 30c to 40c for the rest of the year.

For fun, my GPU will hover around 50-70c, occasionally hitting 80c. The CPU around 50c, and the rest of the machine is a mystery!


Thanks to some utterly awful cooling my laptop CPU idles at about 70c, peaking in the 80s. I don't know if it's within spec for an i7 or just dumb luck but its still running just fine.


  > https://www.youtube.com/watch?v=aT7mnSstKGs
Interesting talk.

I did find that <takes a drink> later in his talk he kept <takes a drink> taking a drink every 10 seconds or so <takes a drink>, which ended up being more than a <takes a drink> little <takes a drink> irritating to watch and listen <takes a drink> to.


> I just love how Google datacenters are somehow "the real world". ... It is easier to do stats this way, though.

From the paper's abstract (you did read the abstract, right?) :

"... While there is a large body of work based on experiments with individual flash chips in a controlled lab environment under synthetic workloads, there is a dearth of information on their behavior in the field. This paper provides a large-scale field study covering many millions of drive days, ten different drive models, different flash technologies (MLC, eMLC, SLC) over 6 years of production use in Google’s data centers. We study a wide range of reliability characteristics and come to a number of unexpected conclusions. For example, raw bit error rates (RBER) grow at a much slower rate with wear-out than the exponential rate commonly assumed and, more importantly, they are not predictive of uncorrectable errors or other error modes. The widely used metric UBER (uncorrectable bit error rate) is not a meaningful metric, since we see no correlation between the number of reads and the number of uncorrectable errors. We see no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes. Comparing with traditional hard disk drives, flash drives have a significantly lower replacement rate in the field, however, they have a higher rate of uncorrectable errors." [0][1]

I guess it's easier to draw incorrect, snarky conclusions based on inaccurate summaries of papers than it is to take a moment to read a paper's abstract to double-check the work of a tech journalist. shrug :(

[0] https://www.usenix.org/conference/fast16/technical-sessions/...

[1] http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f...


> > Ignore Uncorrectable Bit Error Rate (UBER) specs. A meaningless number. ...

> > Bad news: SSDs fail at a lower rate than disks, but UBER rate is higher

> meaningless, it is.

The first sentence is referring to "specs". It's the manufacturer's claims that are useless.

The second point was referring to the actual measured error rates, which are obviously more meaningful.


How are datacenters not "the real world"? They're the largest users of data storage devices!


According to this http://www.kitguru.net/components/hard-drives/anton-shilov/s... 125 milllion HDD shipped in Q1 2015.

Anyone has any idea on how many in Desktop, laptop vs Data Center?


Largest single users of drives, sure. When they have tens of millions of them, I'd trust their reliability studies a lot more than Aunt Mike who has maybe two or three.


> cool and controlled temperatures

Controlled, but likely not cool:

http://www.geek.com/chips/googles-most-efficient-data-center...


Google's data centers are definitely not cool (in the temperature sense). That was one of the big reveals about the original disk paper since Google runs things hotter, the drives experienced a much hotter environment but their bit error rates were not hugely affected.


>The real world has wild temperature ranges, wilder temperature _changes_, mechanical variations well above and beyond datacenter use, and possibly wil wild loads (e.g. viruses, antiviruses, all sorts of updates, etc. etc.).

I'm not sure what you consider "the real world" (SSDs in ToughBooks for research expeditions in the Amazon?), but seeing that HN is a startup/pro IT social site, most of us are interested to be running them in data centers...


I see your points but the reality is there is no real data on how an SSD acts and fails outside of torture tests from the manufacturer or review sites.

Data centre data is still good data to help us better understand SSD lifetime and failures.


I think it's very informative, and over time we'll see if this is representative of the normal world. Maybe temperature changes are not that important, maybe they are.

At the same time, they have Chrome OS. Most of those laptops have an SSD. All are connected to the cloud. How do they perform? When they have a problem, is it recorded by Google? I'm not 100% clear if I would want that, but it doesn't seem such a bad idea for a computer that is already 100% in the cloud.


The Chrome OS number would be less useful. A hard drive failure that totally destroys the ability to report the result back to Google is indistinguishable from a device that simply never turned on again, which over the course of years, I'd expect to dominate major drive failures.


In terms of the things that are going to make SSDs — or pretty much any other piece of hardware you can imagine — fall over, Google's environment is realer than anything you could possibly conceive.


I had an SSD completely fail after only a month of use. It was a cheap-ish KingSpec C3000 128GB. It wasn't recognized in BIOS. Surprisingly, doing something called 'power cycling' made the disk work again. It still works after 2 years of use.

http://forum.crucial.com/t5/Crucial-SSDs/Why-did-my-SSD-quot... > "A sudden power loss is a common cause for a system to fail to recognize an SSD. In most cases, your SSD can be returned to normal operating condition by completing a power cycle, a process that will take approximately one hour.

We recommend you perform this procedure on a desktop computer because it allows you to only connect the SATA power connection, which improves the odds of the power cycle being successful. However, a USB enclosure with an external power source will also work. Apple and Windows desktop users follow the same steps.

1. Once you have the drive connected and sitting idle, simply power on the computer and wait for 20 minutes. We recommend that you don't use the computer during this process.

2. Power the computer down and disconnect the drive from the power connector for 30 seconds.

3. Reconnect the drive, and repeat steps 1 and 2 one more time.

4. Reconnect the drive normally, and boot the computer to your operating system. "


A $15 USB3-SATA external interface will do this just as well, and you don't even have to plug it into a computer. As a bonus, it's a good thing to have lying around a workshop where you might want to temporarily plug in a drive to investigate it.

(Edit: wow, somehow I missed the line you had about that in your post. My apologies.)


Could you recommend any? At our workshop we use Startech's but I recall them being way costlier than 15 bucks.


NewEgg has a bunch of nonames for $7-12 which appear to be identical to the ones in my lab, which also have no identifying marks. Separate AC-DC power supply brick going to a switch and a 4-pin Molex, separate Molex-SATA power converter, and a small black box that takes USB on one side and has a SATA port and 3.5/2.5 PATA on the edges.


While some of those Crucial drives are Micron and pretty reasonable, please beware of comparing "consumer drives" and drives from 2012 / 2013 with today's drives.

The real problem with drives is their firmware has to garbage collect. So, yes, you can push a drive into a corner where it can't escape. Not to name names, but I've had this happen in my company's testing of different drives. That also means there are peculiar results, such as needing to restart the drive, or let it sit for an hour, and seem much better.

Our experience with MLC ( managing and observing many customers' Flash deployments ) has been very positive, and when using major manufacturers' drives works very well.


The ZDNet article doesn't say much more than the paper's abstract does.

Perhaps the link should be changed to either https://www.usenix.org/conference/fast16/technical-sessions/... , or to the paper itself: http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f... (linked from the first link).


This article is light on details, and looking at the source it would be great to know brands and model numbers.

I use SSDs in all my builds, servers and workstations. My most used are Samsung 850 EVOs and PROs, follow by the Intel 750 and Samsung 950 PRO.

Out of bout 80 or so that I've put into production over the past few years I've had about 3 850 EVOs go bad on me, just completely lock up the machine they are connected to, can't even run diag. I make sure to use the PRO series in critical environments, and EVO for budget.


The actual paper has much more information than the fluffy article that is linked here.

From the paper: "The drives in our study are custom designed high performance solid state drives, which are based on commodity flash chips, but use a custom PCIe interface, firmware and driver."

These aren't drives that you can just go out and buy, so brands and model numbers would be meaningless to anyone outside of Google.

Link to the paper: http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f...


> I've had about 3 850 EVOs go bad on me, just completely lock up the machine they are connected to, can't even run diag

That's worrying. It should at least stay a proper PCI-E citizen, not lock up the computer. And it should always provide the SMART data, even if the SSD is dead otherwise. And even if writing isn't possible anymore (safely, due to many bad bits), it should lock-down to read-only.


I was using OCZ Vertex 2 SSD in the past, when (possibly due to a firmware bug) it just wouldn't appear on the SATA bus anymore and had an error LED lit up. I RMA-d the drive, and they said they have to reflash the firmware, but doing so would also loose all data because it would erase the AES key. (I never configured any encryption on the drive, but apparently it does use it by default).

Needless to say I didn't buy OCZ again, but I'm not sure if this is a general problem with SSDs or just Sandforce controllers.


I think this was just an OCZ thing.


Where I work we have had more OCZ drives fail on us than not.

I wouldn't trust then even with temporary data, no matter what performance they can push.


I think the EVO 850 are all SATA, not PCIe. But in general, yes, they should at least appear on the bus...

Interestingly, when this happenned to me, with some "ADATA" SSDs they would still negotiate a link speed, so their PHYs did get initialized. But Linux didn't get any further information from the disks, device type, name capacity... So maybe their firmware crashed halfway through initializing the SSD.


How are the 950 Pro. Are they worth the hype? There is some rumors about them running too hot for laptop use actually.


I only have a few in production. I'm using a 512GB 950 on my gaming PC (has plenty of cooling, so can't comment on the heat, the Intel 750s have a heatsink though...) , it's replacing an 850, in practice it's hard to notice the difference between them. In a virtual environment when you have multiple VMs running on the same SSD, it has noticeable gains, higher IOPS and transfer speed make a big difference.

Most things I encounter in personal use are either CPU or GPU bound. There is nothing worse than having an application crawl to a halt when it is not even using a fraction of your system resources. Single threaded and 32 bit only applications are the bane of my existence.


its unlikely google is going to release brands and model numbers, they considered the information "proprietary" in their previous harddrive studies.


With their scale they might be ordering custom SSDs.

Edit: yep, right in the paper: "The drives in our study are custom designed high performance solid state drives, which are based on commodity flash chips, but use a custom PCIe interface, firmware and driver."


FWIW Facebook published its own research on flash memory failure https://research.facebook.com/publications/a-large-scale-stu...


It's fun to see this data come out. I did some of the early reliability tests for these devices. At the time, Google hadn't publicly announced that we were using SSDs in the data-center so there wasn't really an opportunity to publish or talk about it.

The most surprising thing to me was the correlation between age and RBER.

I would not have guessed that normal aging would play a significant role in reliability. It would be fun to understand what is happening there.


Age means chronological age?

So if I set up an SSD then take it out of the machine and put it on the shelf the data loss would be similar to one that I left in the machine?

Just checking.

If so, it's got to be cosmic rays or some kind of chemical change in the chips (anyone else remember tin whiskers growing on early RF transistors?)


In fact SSDs are not designed to retain data over very long periods while powered down.... particilarly when exposed to higher temperatures, and if they have seen lots of writes.

The interesting note from this abstract is I guess they saw that number of writes didn't seem to play as much of an effect. I don't known if they tested temperature as an independent variable.

I suspect higher temps played a role in influencing Google's results. Higher temps while powered on will increase write endurance. If they were ever powering off drives for efficiency purposes this would also have an effect on read errors, which also gets worse with higher temperature.

I'll quote heavily from AnandTech's article on the topic;

"As always, there is a technical explanation to the data retention scaling. The conductivity of a semiconductor scales with temperature, which is bad news for NAND because when it's unpowered the electrons are not supposed to move as that would change the charge of the cell. In other words, as the temperature increases, the electrons escape the floating gate faster that ultimately changes the voltage state of the cell and renders data unreadable (i.e. the drive no longer retains data).

For active use the temperature has the opposite effect. Because higher temperature makes the silicon more conductive, the flow of current is higher during program/erase operation and causes less stress on the tunnel oxide, improving the endurance of the cell because endurance is practically limited by tunnel oxide's ability to hold the electrons inside the floating gate."

See: http://www.anandtech.com/show/9248/the-truth-about-ssd-data-...


Charge leaking away over time - arrays of capacitors - got it. Thanks.


"High-end SLC drives are no more reliable that MLC drives."

"None of the drives in the study came anywhere near their write limits"

I thought a major difference between MLC and SLC is the number of write cycles?


I think you're right, but what they are saying is that it is irrelevant because the drives/blocks are failing due to age, and not amount of write cycle.


Something I was unaware of: single bit (correctable) read errors are not immediately repaired in NAND. Subsequent in-error reads are repeatedly processed by ECC on the fly, while the media rewrite is scheduled for sometime later.

This inflates the observed correctable error rate to some extent.


FWIW - single-bit errors are not synonymous with correctable errors in flash.

For any given type of flash chip the manufacturer will provide a spec like "you must be able to correct N bits per M bytes". The firmware for a flash drive must use forward error correction codes (e.g. BCH or LDPC) of sufficient strength to correct the specified number of bit errors.

Dealing with a certain amount of bit-errors is just part of dealing with flash.

For example, a chip could have a spec that you must be able to correct up to 8 bit errors per 512 bytes (made up numbers). If the chip had 4KiB pages, each page would be split into 8 "chunks" that were each protected by error correcting codes that were capable of correcting up to 8 single-bit errors in that chunk. As long as no "chunk" had more than 8 bit errors, the read would succeed.

So in this case you could theoretically have a page read with 64 bit errors that succeeded.

This is alluded to in the paper: "The first generation of drives report accurate counts for the number of bits read, but for each page, consisting of 16 data chunks, only report the number of corrupted bits in the data chunk that had the largest number of corrupted bits."


Thanks for your gentle correction. (I was thinking of Gray code, as taught to me during my Wonder Years. You're right as rain of course.)


The first key conclusion threw me a bit. Easy to misread as "Ignore UBER" rather than "Ignore UBER specs".


The article talks about ignoring UBER _specs_ the paper doesn't say that (at least that's not what I read).

From the paper:

    We find that UBER (uncorrectable bit error rate), the
    standard metric to measure uncorrectable errors, is not
    very meaningful. We see no correlation between UEs
    and number of reads, so normalizing uncorrectable errors
    by the number of bits read will artificially inflate the
    reported error rate for drives with low read count."
Basically uncorrectable errors are not correlated to the number of reads, so computing an UBER by dividing the number of uncorrectable errors by the number of reads is meaningless. Therefore UBER is a meaningless metric.

They don't differentiate UBER from a manufacturer spec and UBER as measured in production.


I was expecting to read about drives being bricked by buggy firmware. I guess they finally solved the firmware side.


Most of that stuff got ironed out ~2011. E.g. SandForce controllers: https://en.wikipedia.org/wiki/SandForce#Issues


I believe the high firmware failures rates only effected a relatively small number of models.

Hopefully firmware will get to a point where if the SSD does fail it will be far more graceful (revert to read-only), rather than the sudden death I've seen first hand.


According to the paper, these are full-custom drives with custom firmware. These are not commercial SSDs.


Where did you get that "full-custom" thing?

The FA says they tested: "10 different drive models" of "enterprise and consumer drives".


The original paper states the following:

The drives in our study are custom designed high performance solid state drives, which are based on commodity flash chips, but use a custom PCIe interface, firmware and driver.


The actual paper doesn't describe the drives as enterprise or consumer. Indeed the word "consumer" appears nowhere in the paper. The charitable interpretation of the FA is that its author was making the simplifying analogy of SLC : enterprise :: MLC : consumer.


Bianca Schroeder speaking on SSD reliability in the Stanford EE Computer Systems Colloquium: https://youtu.be/60OmhRJ0CUA. The abstract for her talk can be found at http://ee380.stanford.edu/Abstracts/160224.html.


> 30-80 percent of SSDs develop at least one bad block and 2-7 percent develop at least one bad chip in the first four years of deployment.

Did he perhaps mean 3-8 percent?


No, almost every NAND chip will have bad blocks. The initial ones are discarded during drive initialization in the factory. Then they ideally do some burn-in which may catch a few more bad blocks. Then over the life of the drive you may see thousands of defective blocks depending of capacity and number of NAND dies. However the drive can be designed and tested to tolerate to a expected defect limit with no data loss. Now if there is a bad chip, that can also be handled but the recovery process can be high stress on the SSD like in RAID recovery.


It is in fact 30-80%: see page 11 of the PDF: http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f...


Aren't bad blocks pretty common?


More numbers with specific models would be nice. It would be perhaps useful to do a cost-benefit analysis over time of better SSDs vs. cheap SSDs with more redundancy and backups.


> drive days

What is a drive day?


Much like a man-hour is a measure of one human working for one hour, a drive day is one drive operating for one day. So a million drive days could be a million drives for one day, a thousand drives for a thousand days, or any combination.


One day of continuous use on a drive.


I'd love to hear about folks' experience with the Samsing 850 Pro...


Power failures result in corruption. (To be fair, true of just about every SSD except the Intel DC line. But I have personally experienced this with the 850 Pro.)


Some higher end (e.g. enterprise) SSDs have some form of power backup (e.g. capacitors) to protect against power failure induced corruption. SSDs with power backup tend to be more expensive though than the consumer grade devices (such as the 850 Pro).


The Intel DC (= datacenter) line which I mentioned in my comment has this:

http://www.intel.com/content/www/us/en/solid-state-drives/da...

I'm sure there are others, but I don't know about them. Having surveyed the market myself recently, there's no consumer-grade SSD with such a feature (but the Intel ones aren't too pricey).


Ah UBER is a meaningless number (not a taxi company)




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: