

NAND Flash: Dealing with a Flawed Medium - cushychicken
http://cushychicken.github.io/nand-pt6-dealing-with-flaws/

======
buserror
I personally hate nand. I do embedded stuff all day long, and you spend more
time dealing with NAND issues than anything else in the dev of a product.

It not only ship to you with bad blocks, but will fail if you write to it, it
will also fail /by itself/ if left alone, so even a read-only filesystem is
not safe, until you implement really paranoid duplication of pretty much
everything. You can't JUST rely on ECC, you need to have duplication to allow
the system to continue working; ECC will just tell you 'He's dead, Jim' and
that doesn't help if you're a production device.

Also, from my experience, newer NAND seems to nuke erase blocks by /bunches/
not just by units, so if you lose one, more often than not you love 8+ in a
row. Also, the size of the erase blocks are getting bigger, so you lose a hell
of a lot more every time you lose a block.

Add the bloated file systems in linux (JFFS2 scan at mount time can be 30s
easily!), and you end up wondering if it's really worth the trouble.

These days I always propose using a micro-sd card (of quality) for any large
storage need as it's a lot easier to replace, and a SPI NOR flash for the
system if at all possible (it's horrendously slow to erase/write, but at least
it's stable)

~~~
cushychicken
Having seen a lot of similar issues, I can say I feel your pain. It can be
really frustrating to get to the root of these issues, and when you do,
management is rarely willing to accept the answer of "We've done all we can,
the root of the problem is a device flaw we can't change."

How recent of a jffs2 image are you using? They've implemented some block
tables in newer versions that speed up mount time a lot if my understanding is
correct. Also, I would highly recommend trying out UBIFS if you get the chance
- it's jffs2's successor, and well implemented.

------
mojoe
Until very recently I wrote firmware for SSD controllers. In addition to the
error correction mentioned in the article, we also used RAIN (Redundant Array
of Independant NAND) as yet another data protection measure. You can read more
about it here:
[https://www.micron.com/~/media/documents/products/technical-...](https://www.micron.com/~/media/documents/products/technical-
marketing-brief/brief_ssd_rain.pdf?la=en)

~~~
cushychicken
I'd love to learn more about SSD firmware. Do any companies publish their SSD
NAND management algorithms?

Unrelated: did you work at Micron? I was a Micron DRAM PE for a year.

~~~
mojoe
Companies generally do not publish their algorithms, unless they are filing
for a patent. I came across this great blog post a while back that will give
you a pretty solid foundation for NAND management:
[http://codecapsule.com/2014/02/12/coding-for-ssds-
part-1-int...](http://codecapsule.com/2014/02/12/coding-for-ssds-
part-1-introduction-and-table-of-contents/). You can google more if you find
any of the sections interesting.

Yes, I did work for Micron! I'm doing data science now, though, so I'm not
currently in the storage industry.

~~~
cushychicken
Cool, I'll definitely read this when I get home! Would love to compare notes
against what I know regarding embedded NAND techniques.

The semiconductor industry really isn't a bad jumping off point for data
scientists. Most of the work I did in PE was statistical analysis of DRAM
module tests and trying to hunt down trends in test data to weed out failures
earlier. Having an environment where you can set up experiments with a million
data points is pretty sweet in its own way.

------
jordanbaucke
My uncle started and sold a few companies in the NAND Flash testing market for
manufacturers in the mid-90s and early 2000s. I'm having dinner with him this
evening - any questions I can query him with about the state of the art?

The statement: "Since most NAND manufacturers allow themselves to ship Flash
chips with a certain number of bad blocks, it is important to scan a chip for
bad blocks before the initial programming occurs." makes me wonder about what
degree of discrepancy different manufacturers allow - and how much better a
piece of hardware is that costs significantly more based on "brand-name"
rather than the almost useless junk you can pick-up for a few pennies a GB in
China are?

~~~
cushychicken
Cool! Was he building embedded testers, or commercial testers for
semiconductor vendors? I would imagine that a lot has changed since then,
especially with the shrinking process size and corresponding increase in
retention failures.

In my experience, most of the reputable vendors like Micron, Samsung, and
Spansion are pretty up front with how many bad blocks they allow their devices
to ship with. The Micron device I've worked with most in designs allows 80 bad
blocks per unit at shipment - that's about 2% of total blocks. I've never seen
a brand new chip with that many bad blocks straight off the bat. (Two bad
blocks in a brand new chip is unusual.)

Most of the big vendors moving to support the ONFI standard has actually had a
pretty interesting effect - since NAND is in such demand and has a common
interface, it's become pretty commoditized. If you want to sell chips, you
have to abide by ONFI. If you do that, and your prices are competitive, your
products have been just about guaranteed to sell. As a rule of thumb, however,
I would agree with your sentiment. Vendors who won't share raw bit error rates
or uncorrectable bit error rates are generally not worth the time of day.
That's part of the reason I'm so fond of Micron and have cited them a bunch in
these articles - they are very forthcoming with data about how their parts
work, and how to put them into systems in such a way that they'll work over
your device lifetime.

~~~
jordanbaucke
Completely agree, the industry is very commoditized, and like all SEMIs very
cyclical. Build new capacity, buy test, dry-spell, repeat. Had to fill-out the
bottom-line with CMOS testing, etc.

He was building commercial testers, Sytest and then Nextest, and given the
cyclical nature, the consolidation in the industry circa 2008 when Teradyne
swallowed Eagle & Nextest makes sense. I'll read up on ONFI - any specifics
you're curious about I can ask him directly...

~~~
cushychicken
Does he have any sort of insight on the early washout rates of the chips they
ship? I dunno if you saw the recent Carnegie Mellon/Facebook paper about NAND
retention, but a major finding of that is that there's definitely a "second
bathtub" in NAND devices that ship, but don't live for very long in the field.
Other than that, though, most of my interest in NAND starts long after they
pass out of your uncle's machines and their descendents. :)

Unrelated - I work with a bunch of former Teradyne employees.

