And then there's
> Ignore Uncorrectable Bit Error Rate (UBER) specs. A meaningless number.
> Bad news: SSDs fail at a lower rate than disks, but UBER rate is higher
meaningless, it is.
The real world has wild temperature ranges, wilder temperature _changes_, mechanical variations well above and beyond datacenter use, and possibly wil wild loads (e.g. viruses, antiviruses, all sorts of updates, etc. etc.).
It is easier to do stats this way, though.
They totally are though, especially to the HN crowd where a lot of us may be putting hardware in data centers.
I agree that this isn't going to give us a clear picture of what to expect out of an SSD in say, a netbook or something. On the other hand, data from a million SSD's reported by one company in a controlled environment is a hell of a control group if you want to go test factors like temperature, etc.
* Title makes it obvious, but 95f : http://www.geek.com/chips/googles-most-efficient-data-center...
* Increased in dns request failures(likely due to said heat) and bad routing cause internal iGoogle services to request this guy's stuff: https://www.youtube.com/watch?v=aT7mnSstKGs
For fun, my GPU will hover around 50-70c, occasionally hitting 80c. The CPU around 50c, and the rest of the machine is a mystery!
I did find that <takes a drink> later in his talk he kept <takes a drink> taking a drink every 10 seconds or so <takes a drink>, which ended up being more than a <takes a drink> little <takes a drink> irritating to watch and listen <takes a drink> to.
From the paper's abstract (you did read the abstract, right?) :
"... While there is a large body of work based on experiments with individual flash chips in a controlled lab environment under synthetic workloads, there is a dearth of information on their behavior in the field. This paper provides a large-scale field study covering many millions of drive days, ten different drive models, different flash technologies (MLC, eMLC, SLC) over 6 years of production use in Google’s data centers. We study a wide range of reliability characteristics and come to a number of unexpected conclusions. For example, raw bit error rates (RBER) grow at a much slower rate with wear-out than the exponential rate commonly assumed and, more importantly, they are not predictive of uncorrectable errors or other error modes. The widely used metric UBER (uncorrectable bit error rate) is not a meaningful metric, since we see no correlation between the number of reads and the number of uncorrectable errors. We see no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes. Comparing with traditional hard disk drives, flash drives have a significantly lower replacement rate in the field, however, they have a higher rate of uncorrectable errors." 
I guess it's easier to draw incorrect, snarky conclusions based on inaccurate summaries of papers than it is to take a moment to read a paper's abstract to double-check the work of a tech journalist. shrug :(
> > Bad news: SSDs fail at a lower rate than disks, but UBER rate is higher
> meaningless, it is.
The first sentence is referring to "specs". It's the manufacturer's claims that are useless.
The second point was referring to the actual measured error rates, which are obviously more meaningful.
Anyone has any idea on how many in Desktop, laptop vs Data Center?
Controlled, but likely not cool:
I'm not sure what you consider "the real world" (SSDs in ToughBooks for research expeditions in the Amazon?), but seeing that HN is a startup/pro IT social site, most of us are interested to be running them in data centers...
Data centre data is still good data to help us better understand SSD lifetime and failures.
At the same time, they have Chrome OS. Most of those laptops have an SSD. All are connected to the cloud. How do they perform? When they have a problem, is it recorded by Google? I'm not 100% clear if I would want that, but it doesn't seem such a bad idea for a computer that is already 100% in the cloud.
> "A sudden power loss is a common cause for a system to fail to recognize an SSD. In most cases, your SSD can be returned to normal operating condition by completing a power cycle, a process that will take approximately one hour.
We recommend you perform this procedure on a desktop computer because it allows you to only connect the SATA power connection, which improves the odds of the power cycle being successful. However, a USB enclosure with an external power source will also work. Apple and Windows desktop users follow the same steps.
1. Once you have the drive connected and sitting idle, simply power on the computer and wait for 20 minutes. We recommend that you don't use the computer during this process.
2. Power the computer down and disconnect the drive from the power connector for 30 seconds.
3. Reconnect the drive, and repeat steps 1 and 2 one more time.
4. Reconnect the drive normally, and boot the computer to your operating system.
(Edit: wow, somehow I missed the line you had about that in your post. My apologies.)
The real problem with drives is their firmware has to garbage collect. So, yes, you can push a drive into a corner where it can't escape. Not to name names, but I've had this happen in my company's testing of different drives. That also means there are peculiar results, such as needing to restart the drive, or let it sit for an hour, and seem much better.
Our experience with MLC ( managing and observing many customers' Flash deployments ) has been very positive, and when using major manufacturers' drives works very well.
Perhaps the link should be changed to either https://www.usenix.org/conference/fast16/technical-sessions/... , or to the paper itself: http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f... (linked from the first link).
I use SSDs in all my builds, servers and workstations. My most used are Samsung 850 EVOs and PROs, follow by the Intel 750 and Samsung 950 PRO.
Out of bout 80 or so that I've put into production over the past few years I've had about 3 850 EVOs go bad on me, just completely lock up the machine they are connected to, can't even run diag. I make sure to use the PRO series in critical environments, and EVO for budget.
From the paper: "The drives in our study are custom designed high performance
solid state drives, which are based on commodity
flash chips, but use a custom PCIe interface, firmware
These aren't drives that you can just go out and buy, so brands and model numbers would be meaningless to anyone outside of Google.
Link to the paper: http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f...
That's worrying. It should at least stay a proper PCI-E citizen, not lock up the computer. And it should always provide the SMART data, even if the SSD is dead otherwise. And even if writing isn't possible anymore (safely, due to many bad bits), it should lock-down to read-only.
Needless to say I didn't buy OCZ again, but I'm not sure if this is a general problem with SSDs or just Sandforce controllers.
I wouldn't trust then even with temporary data, no matter what performance they can push.
Interestingly, when this happenned to me, with some "ADATA" SSDs they would still negotiate a link speed, so their PHYs did get initialized. But Linux didn't get any further information from the disks, device type, name capacity... So maybe their firmware crashed halfway through initializing the SSD.
Most things I encounter in personal use are either CPU or GPU bound. There is nothing worse than having an application crawl to a halt when it is not even using a fraction of your system resources. Single threaded and 32 bit only applications are the bane of my existence.
Edit: yep, right in the paper: "The drives in our study are custom designed high performance solid state drives, which are based on commodity flash chips, but use a custom PCIe interface, firmware and driver."
The most surprising thing to me was the correlation between age and RBER.
I would not have guessed that normal aging would play a significant role in reliability. It would be fun to understand what is happening there.
So if I set up an SSD then take it out of the machine and put it on the shelf the data loss would be similar to one that I left in the machine?
If so, it's got to be cosmic rays or some kind of chemical change in the chips (anyone else remember tin whiskers growing on early RF transistors?)
The interesting note from this abstract is I guess they saw that number of writes didn't seem to play as much of an effect. I don't known if they tested temperature as an independent variable.
I suspect higher temps played a role in influencing Google's results. Higher temps while powered on will increase write endurance. If they were ever powering off drives for efficiency purposes this would also have an effect on read errors, which also gets worse with higher temperature.
I'll quote heavily from AnandTech's article on the topic;
"As always, there is a technical explanation to the data retention scaling. The conductivity of a semiconductor scales with temperature, which is bad news for NAND because when it's unpowered the electrons are not supposed to move as that would change the charge of the cell. In other words, as the temperature increases, the electrons escape the floating gate faster that ultimately changes the voltage state of the cell and renders data unreadable (i.e. the drive no longer retains data).
For active use the temperature has the opposite effect. Because higher temperature makes the silicon more conductive, the flow of current is higher during program/erase operation and causes less stress on the tunnel oxide, improving the endurance of the cell because endurance is practically limited by tunnel oxide's ability to hold the electrons inside the floating gate."
"None of the drives in the study came anywhere near their write limits"
I thought a major difference between MLC and SLC is the number of write cycles?
This inflates the observed correctable error rate to some extent.
For any given type of flash chip the manufacturer will provide a spec like "you must be able to correct N bits per M bytes". The firmware for a flash drive must use forward error correction codes (e.g. BCH or LDPC) of sufficient strength to correct the specified number of bit errors.
Dealing with a certain amount of bit-errors is just part of dealing with flash.
For example, a chip could have a spec that you must be able to correct up to 8 bit errors per 512 bytes (made up numbers). If the chip had 4KiB pages, each page would be split into 8 "chunks" that were each protected by error correcting codes that were capable of correcting up to 8 single-bit errors in that chunk. As long as no "chunk" had more than 8 bit errors, the read would succeed.
So in this case you could theoretically have a page read with 64 bit errors that succeeded.
This is alluded to in the paper: "The first generation of drives report accurate counts for the number of bits read, but for each page, consisting of 16 data chunks, only report the number of corrupted bits in the data chunk that had the largest number of corrupted bits."
From the paper:
We find that UBER (uncorrectable bit error rate), the
standard metric to measure uncorrectable errors, is not
very meaningful. We see no correlation between UEs
and number of reads, so normalizing uncorrectable errors
by the number of bits read will artificially inflate the
reported error rate for drives with low read count."
They don't differentiate UBER from a manufacturer spec and UBER as measured in production.
Hopefully firmware will get to a point where if the SSD does fail it will be far more graceful (revert to read-only), rather than the sudden death I've seen first hand.
The FA says they tested: "10 different drive models" of "enterprise and consumer drives".
The drives in our study are custom designed high performance
solid state drives, which are based on commodity
flash chips, but use a custom PCIe interface, firmware
Did he perhaps mean 3-8 percent?
What is a drive day?
I'm sure there are others, but I don't know about them. Having surveyed the market myself recently, there's no consumer-grade SSD with such a feature (but the Intel ones aren't too pricey).