One of the things I had to model was the failure mode of the device. It was simple, I assumed:
- Stuck bits (at 1 or 0) would cause either a data write failure, or a failed attempt to write to a tracking structure. Both recoverable by using other storage.
- Failure to erase would result in a bad block.
- Writes to other memory wouldn't cause disturbance to already committed writes
- ... a few other things as well.
NOR makes it easy, I guess. For the time, and for the quality of flash, it worked pretty well. The Intel flash guys told us we were being too paranoid, even as it was.
NAND flash looks a lot more byzantine, and I can well imagine that survival of your data depends in large part on how much the firmware team understands error recovery and transactions. Having been there, I'm really cautious about SSD brands that I use (no names here, but you can probably guess who I avoid).
As for the original report, if you rely on your storage devices to survive a power outage you should be testing them strongly for this feature. It is a feature that is easy to get wrong and when it fails on you it is quite nearly the end of the world as this is a correlated failure of multiple devices at the same time with no time for any rebuild operation.
I've seen different devices even of high end caliber that had flaws in their power-hold circuitry. If you rely on it, test it. The same is true for HDDs too. If you rely on your UPS to hold you for the power outage, be sure to test that one too. And test all of these also as they age, the age affects the ability to hold power in both battery and capacitors.
It can if there's enough redundancy. If your pool is a stripe of mirrors, then you can mitigate this effect by putting each half of a mirror on separate power supplies.
Mirroring is also part of ZFS, but it's not ZFS magic. My preference is the Dynamo / Hadoop / layered approach in which the application is aware of multiple underlying storage devices, and is written in a way that's defensible to underlying device failure.
No, the data will be repaired transparently and you won't notice a thing. Errors will be logged, of course, but you'll know that the data is intact.
My preference is the Dynamo / Hadoop / layered approach in which the application is aware of multiple underlying storage devices, and is written in a way that's defensible to underlying device failure.
And ZFS should be the underlying file system upon which these applications are built. If you don't have protection against silent data corruption at the lowest level then you aren't going to have any idea when it happens.
Power fault often occur at the AC power, however, and only later gets to the DC power. This is impractical to test as there are too many configurations/variables and it's much harder to test accurately, but it would theoretically be interesting to see if AC power faults to a system power supply had a different affect on the data than DC power faults.
> Fear, uncertainty and doubt (FUD) is a tactic used in sales, marketing, public relations, politics and propaganda.
>FUD is generally a strategic attempt to influence perception by disseminating negative and dubious or false information. An individual firm, for example, might use FUD to invite unfavorable opinions and speculation about a competitor's product; to increase the general estimation of switching costs among current customers; or to maintain leverage over a current business partner who could potentially become a rival.
>The term originated to describe disinformation tactics in the computer hardware industry but has since been used more broadly.[dubious – discuss] FUD is a manifestation of the appeal to fear.
>About Robin Harris
>Robin Harris has been a computer buff for over 35 years and selling and marketing data storage for over 30 years in companies large and small.
>Robin Harris is a president of TechnoQWAN, a consulting and analyst firm in Sedona, Arizona. He also writes StorageMojo.com, a blog which accepts advertising from companies in the storage industry, and has a 30 year history with IT vendors. He has many industry contacts, many of whom are friends and all of whom he has opinions about.
The poor performance of mechanical disks is the number one performance bottleneck for most IT systems. I believe that FUD like this is the main reason SSD costs are still relatively high.
From your quoted definition, "FUD is generally a strategic attempt to influence perception by disseminating negative and dubious or false information." (Emphasis added.)
It is neither dubious nor false that SSDs without some form of capacitance, in order to flush outstanding writes cached in the on-board buffer to the NAND media in the event of a power loss, will (eventually) suffer data loss or corruption. To think otherwise betrays a fundamental misunderstanding of how SSDs actually work.
I've known about this for years, and have taken extreme care only to use SSDs in laptops (with their built-in battery), hosts with UPS backup power, configured to gracefully shut down if mains power is lost, and/or drives that have onboard capacitors or super-caps.
This has, in practice, meant using only Intel 320 and DC3x00 series drives, though the OWC Enterprise class drives also have super-caps (if you're willing to pay that high a premium for your storage). The Crucial m500 drives appear also to have a bank of small capacitors, but I currently have no direct experience with them.
> In our experiments, we observed five out of the six expected failure types, including bit corruption, shorn writes, unserializ- able writes, metadata corruption, and dead device. Surprisingly, we found that 13 out of 15 devices exhibit failure behavior contrary to our expectation. This result suggests that our proposed testing framework is effective in exposing and capturing the various durability issues of SSDs under power faults.
Spinning hard drives fail too. All drives fail eventually.
Why hasn't this been a problem with the cache on spinning disks? Are they typically only buffering reads?
It's well known in communities that care about storage reliability (such as the database world, which is where my experience comes from) that consumer-grade drives, in particular, are notorious for lying about writes having been successfully flushed to the underlying media.