Hacker News new | comments | ask | show | jobs | submit login
A Large-Scale Study of Flash Memory Failures in the Field (facebook.com)
63 points by danso on Aug 11, 2015 | hide | past | web | favorite | 7 comments

In my own observation, based on mostly the lower end Samsung SSDs (840/840 evo) in desktops and laptops at home and work, the SSDs I've been using (one even failed) appear to have problems reading back 'very stale' (year or two old) data.

I've recently been backing them up, using the 'ATA security erase' (I'm doubtful this is really a security erase instead of 'factory re-trim' erase) command, and then restoring the backup. This appears to improve read performance, and I hope it also allows the drive to continue operating for it's intended lifetime.

I hypothesized this might help after observing that magnetic hard disks also exhibited a recovery of read performance after a 'non-destructive write test' using the badblocks program.

You aren't alone.

The 840 and 840EVO were Samsung's first drives using 19nm planar TLC flash. With TLC flash they were storing 4 bits per flash cell, which at that small process size gave all sorts of trouble with old, stale data. It was a pretty well documented problem: http://www.anandtech.com/show/8997/samsung-releases-statemen...

The newer Samsung TLC drives use V-NAND, which supposedly has lower cell-to-cell interference due to the larger process size and 3D structure: http://www.samsung.com/global/business/semiconductor/html/pr...

Samsung has some firmware updates and software utilities that should be able to restore the performance of your SSD's.

TRIM and "secure erase" are mostly the same thing, but not quite. TRIM is a command the OS sends to say "hey, this data was erased, so you can now safely erase this block at any time", while secure erase means "erase this block right now". With TRIM the SSD firmware has the decision of when to do it, with the other it just goes and does it.

As for the performance, because of how Flash technology works, it can't rewrite data, it first has to "reset" it back to the block/cell being all 1 bits, and then sets specifics bits to 0's. TRIM (and the Garbage Collection algorithms) work around this issue by allowing the SSD to reset blocks in advance, so they're always prepared for a new write. And then there are at least two more mechanisms working along to ensure the lifetime of the drive.

For a good insight into what's going on with your drives, you can read the Wikipedia article, as a start: http://en.wikipedia.org/wiki/Flash_memory

Edit: you meant "read performance" and I read just "performance". Mmmm... on old data, what might be going on is that charge leaked from some of the bits that compose your files, and the ECC algorithms have to go a greater length to correctly recompose your information -- so reads end up being slower. That's my best guess.

Indeed, that was my guess as well given the performance improvements re-writing the data. The facebook study also observes to another potential source for the performance variation.

According to the paper on /some/ tested drives the presence of trimmed data (such as through OS updates/etc) could lead to more complicated re-mapping patterns and thus a longer tree search when finding the correct block to read. However that doesn't correlate well with the observed speed at which the backups were reading data over time. The read speed was highly variable; with an extremely slow initial portion that would likely correlate well to stable windows install files.

dr_zoidberg's reply males a lot of sense - it sounds like you experienced a lot of basic retention and errors, which is reasonably common on modern small process NAND. Since you're executing a block erase at the device level by doing the secure erase operation. It's setting all those pages that leaked charge - and by extension, your data - back to 1s,so they're ready to be written again.

I'm a little surprised that the stale time was on the order of a year. Have you experienced any other failures?

As a guy working in digital forensics and data recovery, I find this very interesting. When it comes to flash-based storage there's very little real information and studies, with a lot of myths that just add noise and almost no value.

It is killing me to discover this at work - I only have time to read the abstract, can't wait to read this at home.

Not super shocked to see that read disturbs are a rare cause of field failures, just because I wouldn't imagine any of the data Facebook uses would sit still on one drive long enough. I would expect them to fall victim to failure modes like retention errors, since they're probably running these drives really hot, and using some of the most cutting edge technology possible - MLC, or maybe even TLC, where the margins of individual energy states isn't so high.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact