

NAND Flash: How It Breaks - cushychicken
http://cushychicken.github.io/nand-pt5-how-nand-breaks/

======
userbinator
_NAND Flash is cheap - in fact, in terms of cost per bit, it is one of the
cheapest memory technologies on the market._

What I find most unfortunate is that almost everyone somehow gets so focused
on the capacity of their nonvolatile storage devices that they don't think
about longevity at all, and the industry feeds into this by advertising
bigger, cheaper memories while downplaying the disadvantages. Meanwhile the
"good stuff" is priced much higher than it should be. For example, multilevel
cell technology only multiplicatively increases capacity, but exponentially
decreases endurance and retention. This fact seems seldom-mentioned. 16Gbit
SLC and 32Gbit 4-level (2-bit) MLC made on the same process should be the same
area and price, but as of this post DRAMExchange says the SLC costs over 4x
more.

I know ECC and advanced wear-leveling can help, but all this adds extra
complexity to the system and introduces more points of failure (see all the
SSD firmware problems for example.)

~~~
cushychicken
You could almost think of it as the MLC part being discounted for it's lesser
reliability. There are a few good systems for counteracting that lesser
reliability, but fundamentally, you're right - it's just another subset of
failure points you're introducing.

There are some interesting options for counteracting the higher MLC failure
rates - Yu Cai and the VLSI Design Group at Carnegie Mellon have produced some
pretty interesting work hashing out the problems with MLC Flash and how to
counteract them. For example: the in-place reprogram they've suggested would
really help counteract retention errors in MLC, which are the most frequent
kinds of errors in that medium.

~~~
t0mas88
Or the opposite, the SLC carrying a price premium because it's bought by
parties that need the extra reliability and thus have a higher budget.

------
jwise0
This is a great post, and really enhances my otherwise-rudimentary knowledge
of the failure physics of NAND flash. Program disturb, in particular, is
something that I didn't really have a good understanding of.

It looks like the author submitted this; if you're reading, I wrote an article
a while ago [1] with some of my reverse-engineering of a NAND flash part, and
I'd be interested to see which parts you think are on point and which parts
seem totally wrong. I'm looking forward to your next article about device
management!

[1]
[http://www.joshuawise.com/projects/ndfslave](http://www.joshuawise.com/projects/ndfslave)
\-- HN discussion:
[https://news.ycombinator.com/item?id=8133450](https://news.ycombinator.com/item?id=8133450)

~~~
cushychicken
Hey Josh! I saw your original post, and I was fucking blown away by it - very
nice work. It's been a while since I read it, but I seem to recall that you
had the bulk of it right. I'm planning on writing another post after this one
talking about some basic device management methods, and then work my way up to
how they fit into embedded Flash filesystems. One of the things I'd like to do
as part of this is write up the BCH error correction algorithm that's commonly
implemented in NAND to check for bad bits - it's only slightly more
complicated than the row/column parity algorithm you mentioned in your post.

Very glad you liked the post, and flattered that you got something out of it.
I saw on your site that you're a CMU graduate - did you happen to take any
classes or do any research with Dr. Yu Cai? Never met him myself, but he's
written some great papers on NAND Flash device physics that I've referred to
frequently in studying NAND.

------
rasz_pl
Remember self healing NAND Flash with embedded heaters? What happened to that?
Too reliable for todays planned obsolescence?

[http://www.extremetech.com/computing/142096-self-healing-
sel...](http://www.extremetech.com/computing/142096-self-healing-self-heating-
flash-memory-survives-more-than-100-million-cycles)

~~~
cushychicken
Interesting! My understanding of Flash (and semiconductor devices in general)
suggests that higher temperatures mean higher electron energies - in the case
of NAND, that means higher leakage currents from the floating gates to other
bodies in the device.

I'll have to dig more on this - thanks for sharing!

