The first part is pretty easy to understand, storage manufacturers have competed for years in a commodity market where consumers often choose price per gigabyte over URER (the unrecoverable read error rate), further at the scale of the market small savings of cents adds up to better margins. And while the 'enterprise' fibre channel and SCSI drives could (and did) command a hefty premium, the shift to SATA drives really put a crimp in the over all margin picture. So the disk manufacturers are stuck between the reliability of the drive and the cost of the drive. They surf the curve where it is just reliable enough to keep people buying drives.
This trend bites back, making the likely hood of an error while reading more and more probable. Not picking on Seagate here, they are all similar, but lets look at their Barracuda drive's spec sheet . You will notice a parameter 'Nonrecoverable Read Errors per Bits read', and you'll see that its 1x10e14 across the sheet, from 3TB down to 250GB. It is a statistical thing, the whole magnetic field domain to digital bit pipeline is one giant analog loop of a error extraction. 1x10E14 bits. So what does that mean? Lets say each byte is encoded in 10 bits on the disk. Three trillion bytes is 30 trillion bits in that case, or 3x10E13 bits. Best case, if you read that disk from one end to the other (like you were re-silvering a mirror from a RAID 10 setup) you have a 1 in 3 chance that one of those sectors won't be readable for a perfectly working disk. Amazing isn't it? Trying to reconstruct a RAID5 group with 4 disks remaining out of 5, look out.
So physics is not on your side, but we've got RAID 6 Chuck! Of course you do, and that is a good thing. But what about when you write to the disk, the disk replies "Yeah sure boss! Got that on the platters now" but it didn't actually right it? Now you've got a parity failure waiting to bite you sitting on the drive, or worse, it does write the sector but writes it somewhere else (saw this a number of times at NetApp as well). There are three million lines of code in the disk firmware. One of the manufacturers (my memory says Maxtor) showed a demo where they booted Linux on the controller of the drive.
Bottom line is that the system works mostly, which is a huge win, and a lot of people blame the OS when the drive is at fault, another bonus for manufacturers, but unless your data is in a RAID 6 group or on at least 3 drives, its not really 'safe' in the "I am absolutely positive I could read this back" sense of the word.
The worst part of this is that often when disks fail, they just become extremely slow (100s of milliseconds) rather than explicitly failing. That can be significantly worse than just failing explicitly and having the OS read from another disk.
Nit: Remember the old fallacy: if there's a 50% chance of rain on each of Saturday and Sunday, that doesn't mean there's 100% chance of rain this weekend. The quoted number of nonrecoverable errors per bits read is presumably an expected value on a uniform distribution (which is ludicrous, since errors are not even remotely independent, but I don't know what else they could be expecting the reader to assume). In that case, the expected probability of an error on each bit read is 1 / 1E14. The probability of zero errors after reading 3E13 bits is ((1E14-1)/1E14)^3E13, which is about 75%.
This is the main difference between consumer-grade and enterprise-grade drives firmware. Take the HGST Deskstar and Ultrastar, or Seagate Constellation ES and Barracuda 2 TB; they are physically exactly the same, but behave very differently when encountering an error. Basically the enterprise drive will fail shortly BUT recover (because it's probably RAID-backed), while the desktop drive will try to read the data when asked at all prices (because it's probably NOT RAID-backed), therefore suffering from extremely long time-outs.
> . The probability of zero errors after reading 3E13 bits is ((1E14-1)/1E14)^3E13, which is about 75%.
Still not the sort of statistics you want betting your precious data against...
That is, in fact, why I pay the huge premium for 'enterprise' sata.
And yes, Usually 'enterprise' drives fail rather than just getting really slow.
But not always. 'enterprise' drives sometimes get shitty rather than just failing outright, too. I mean, it's usually like 1/3rd to 1/5th expected performance rather than 1/100th, but this is disk. It's already on the edge of unacceptably slow. just cutting performance in half means I'm getting complaints.
Overall? reducing this problem from twice a month to twice a year is worth the premium, but all spinning rust is shit. The 'enterprise' stuff is only slightly less shit.
It bothers me that RAID controllers don't handle this more intelligently.
Very interesting post btw. Thanks.
So in the storage business everything used to be done on the highest performance, most reliable drives. Typically those were 15K RPM Fibre Channel drives which cost 8x the price of an equivalent SATA drive . In 2003 NetApp created the 'NearStore' line which was filers that used SATA drives. The idea was that tape was way to slow, you could put stuff on SATA drives and for low duty cycles get to it a lot faster, SATA drives were 'nearline' storage which was defined to be "about 99.8% available." This was hugely better than tape but not as good as a "real" disk array. Of course what people discovered was that you could have some pretty huge data sets be reasonably accessible that way, the first folks were the oil and gas folks  but lots of people jumped onto the bandwagon, it was cheaper than a storage array, faster than tape.
Now lets look at flash. Most people think of flash as a disk replacement because it connects to the computer through a disk interface (even though it doesn't have disk mechanicals). So it seems like a really fast disk. My assertion though is that it isn't a fast disk, its slow memory. It is 'near line' memory. Which is to say it takes a lot longer to access something in flash than it does in memory, but its craploads faster than getting off of spinning rust.
Four years ago, I told a guy who claimed to be Intel's key flash architect that if Intel would put flash on the same side of the MMU as the processor we would completely change the way servers are built. Why? Can you imagine that you've got 200GB of address space that is already ready to go when you turn on the processor? Once you start loading page tables you read in the ones in non-volatile flash and blam all your data structures are ready to go, right now. You want to make some sort of logical computation based on the state of a 50GB data structure, and all you do is follow the pointers? The 'driver' for flash attached as memory is
var = *((var type *)(0xsome64bitaddress);
Bottom line Solid State drives are doomed, you can't make them much denser the physics doesn't work, you don't want to waste your time going through the disk driver and I/O subsystem to get to what is essentially more memory. But as flash moves closer to the CPU its impact will become pretty impressive.
 This lead people like IBM to propose 'bricks' which were 8 SATA drives in a RAID or Mirrored config pretending to be a single reliable drive.
 One of my favorite Dave Hitz quotes: "Oil and Gas companies have an awesome compression algorithm, they can take a 600TB data set and compress it to one bit, 'oil' or 'no oil.'"
 Yes the new Sandy Bridge DDIO architecture helps with this. (http://www.realworldtech.com/sandy-bridge-ep/2/)
Not trying to be a party pooper, but flash does not behave like random access memory at all.
With RAM, you can erase/overwrite memory at any address at will.
With both NAND and NOR flash, you can only update 1s to 0s or 'reset' bits to 1. But this is only done in chunks. Flash is actually called 'flash' because in the early prototypes, there was a visible flash of light when a block was reset to all 1s.
This means that to erase part of a block, any data that is to preserved within this block first has to be moved away. So the controller uses a block mapper to keep the address presented on the interface the same, while actually relocating data to someplace else (a bit like the remapping that common hard drives do when a sector goes bad).
To further complicate matters, there's only so many times a block can be erased — erasing causes wear. So the block mapper also incorporates a mechanism for wear leveling. This is also why you want to align your partitions to match the erasure blocks. Otherwise you'll get write amplification; you'd write to 1 block of your filesystem, but if that block straddles erasure blocks, it would cause 2 relocations.
There are filesystems in the Linux kernel that are built to do wear leveling (i think it was yaffs/ubifs), they operate on MTD devices (/dev/mtd?) that do not do their own wear leveling. If you use CF or SD cards, or USB flash drives, they'll have a controller which does the wear leveling (but they don't always do this in a sane way).
So, sure, you could get Flash to behave like RAM, but it would be through emulation, by hiding the relocation and wear leveling logic. You could propose doing this in some super-heavy special MMU and I am with you in sofar as that it could be of some benefit to eliminate some abstraction layers like ATA and avoid hitting a bus (pci, or usb on pci).
It actually reminds me a lot of old core memory (http://en.wikipedia.org/wiki/Magnetic-core_memory). In order to read core memory, you had to actually try to write to it. If the voltage changed, you guessed wrong, so you then knew the correct value.
It's a nice idea, but the driver for flash needs to do much more than this. For example, blocks need to be erased before they can be re-written, and since they have a limited lifecycle blocks must also be copied around and remapped.
The latency of NAND flash will be so much more than DRAM that you'd have to use DRAM as the cache layer before the flash in which case you could just use one of those block caching schemes coupled with a PCIe SSD to get everything that is good with very little trouble at the architecture level.
But it would be useful to put NAND controller closer to CPU, remove block device emulation layer and let OS do wear-leveling and block allocation. We could get high-performance swap or put there specialized MTD filesystems like UBIFS.
>You want to make some sort of logical computation based on the state of a 50GB data structure, and all you do is follow the pointers? The 'driver' for flash attached as memory is
> var = ((var type )(0xsome64bitaddress);
You don't need any changes in hardware for this, you can use memory mapped files. Main reason that today software serialise everything is interoperability. For example, old versions of Microsoft Office applications did this, modern don't.
Also read about PhantomOS.
That still triggers the OS drivers on the other end which has to go through the same complicated drivers to read blocks into RAM before mapping them into the process, so it's nowhere comparable to what he's proposing.
Flash on the main bus would mean the only thing a read would go through would be the MMU page translation, then straight to either the flash chips or the flash controller, without being bottlenecked by passing through other parts of the system.
And that bottleneck is already real - I have SSD setups at works that are pretty much maxing out the PCIe bus on those servers, and they're not even that expensive (as long as you want performance, not large amounts of storage).
Now, there are still challenges, not least that either you're still talking to a flash controller rather than directly to the flash chips, or you have to handle wear levelling etc. yourself if you want to do writes. Both have disadvantages.
Some embedded systems do talk straight to the flash chips themselves and handle the wear levelling directly - I once put Linux on a BIOS-less x86 system with flash directly on the main bus where the "bootloader" did the equivalent of a memcpy to copy the kernel into a suitable location, set up some very basic stuff and jumped to it, and where the "disk" used an FTL (flash translation layer) driver that handled wear levelling in the kernel code.
But this would require completely changing how modern operating systems work. No doubt applications developers would love it though.
disks and filesystems as a weird, separate appendage hanging off the side of a computer are conceptually a weak idea.
Once you've unified your address space - stable storage is in the same space as unstable storage, then your entire world changes. The trick then becomes, how do you represent objects (Hint: Others have already tread this ground, see AS/400).
a) data is buffered during write and that a power failure won't leave the storage in an inconsistent state
b) blocks to be overwritten are reallocated and erased in background (you cannot simply overwrite a flash block before erasing it)
c) the used datastructures (trees etc) are optimized for the size of the flash block
d) gracefully handle block corruption
Since it should be possible to remap blocks (either because of (b) or corruption (d)) it's common to have a logical block remapping layer somewhere in the (disk) controller. At this point classical filesystem algorithms and datastructure can easily solve the remaining problems (consistency with journaling and block optimized access like on rotating disks).
Of course, there is another approach for solving (b): see http://en.wikipedia.org/wiki/Log-structured_file_system
(actually, log structured filesystem technique can be used to implement the block remapping within the ssd compatibility layer which lies inside the disk controller itself, but that's a detail)
or copy on write filesystems like zfs/btrfs etc (although AFAIK they rely on a disk interface which is compatible to a rotating disk).
That's to say that solid state storage is not really just persistent ram where you can apply the same datastructures and procedures you would use in core.
Future technologies might be different and trim down these differences so that you could really have huge amount of super fast word addressable random memory access, where nothing can be lost even in case of power failure. But, even assuming that this might be possible, it's possible that later there will be yet another technology with even better storage density/ cost but which will introduce back some of the complications. So, it's understandable why you want to keep some abstractions, so that you can easily replace the underlying technologies and get most of the benefits.
OTOH 1 and 2 TB from HGST were absolutely rock-solid from the start. Only lost a handful of them among several thousand installed. By contrast in the past years I've had a steady 50% failure rate on Barracuda ES2, the shittiest drive on earth since the legendary 9 GB Micropolis.
Whatever brand ships with HP netbooks is basically a failure waiting to happen (we bought 100 HP netbooks with 3G from Verizon - they are very sick of us calling).
It used to be that Maxtor (when they were around and independent) had the most reliable drives. Now, it's who knows...
Things is, all drive manufacturers have batches and you might end up with 8 dead out of 10 drives for any manufacturer at any given time.
The only thing you can do is try to get drives from different batches and thus trying to minimize having all eggs in one basket (what good is raid6 if all drives die?).
Also when a lot of drives die you should look for other culprits as well, is the temperature OK? Is the power stable? (Vibrations?)
There isn't much statistics available (that I know of) but a french store published their return-rate for drives (which of course doesn't count drives returned directly to the manufacturer)
Or a write up in Swedish about it (probably easier to read if you don't know either language, lot of graphs): http://www.sweclockers.com/nyhet/13859-fransk-datorsajt-publ...
But all in all, if you have many drives failing you've hit a bad batch or are treating your drives badly, nothing you can do about it (the former at least).
Also remember the demo from a Sun guy that showed how screaming to your RAID array had a significant impact on IO in real time (it was actually a dtrace demo): https://www.youtube.com/watch?v=tDacjrSCeq4
Someone ( in the storage industry) also told me of some crap RAID enclosure that had a normal performance until you added a drive in the center slot; then some bad vibration resonance kicked in and the performance dropped terribly :)
Right... that's my point. I had a bad experience with WD once upon a time (multiple drives, all different batches). And when I had to build a ~1TB RAID in 2001, all of my research pointed to Maxtor. Side note: That actually turned out to be a good choice. The drives lasted 5+ years and didn't die until a fan gave out in the drive cage. Poor little guys got too hot and seized up. After all of that, I was still able to pull 99.5% of the data off of them.
What I was trying to get at is that looking at any specific case and making any blanket statements about quality of drives. Ask 5 people what are the best and worst drives and you'll get 5 different answers. We all have horror stories. The only way to know for sure is to have actual population-level data on reliability. Unfortunately, that data is somewhat hard to come by, so we all just rely on anecdotal stories.
It also depends on if we are talking consumer drives, enterprise drives, etc... I doubt that the return rate for drives would actually cover serious data errors which occurred after the return timeframe.
Honestly, the only people who could offer any insight are the large companies with huge datacenters: Amazon, Google, Yahoo, Rackspace, etc... and I doubt you'll hear them talking. If you could, I suspect the answer would be that it really doesn't matter which manufacturer you choose. All of them fail. All of them have bad batches. The best that you can do is try to minimize the MTBF and try to gracefully replace failed drives as soon as possible.
Yup, this gives you good insights into the 'current' crop. We've got about 15,000 drives in our Santa Clara data center at Blekko (mostly enterprise SATA, 2TB (WD), but some 1TB (Seagate) too) In large populations like that I prefer to keep one family, which I know folks decry a monoculture but that keeps the failure rate more consistent across all drives which helps manage replacing them.
Google did some research on this with their datacenters. There is apparently no correlation between heat & diskfailures.
... and before that begins, the data is added to a giant ring buffer that records the last 30 TB of new writes, on machines with UPS. If a rack power failure happens, the ring buffer is kept until storage audits complete.
I think every company that deals with large data they can't lose develops appropriate paranoia.
The behavior of hard drives is like decaying atoms. You can't make accurate predictions about what any one of them will do. Only in aggregate can you say something like "the half life of this pile of hardware is 12 years" or "if we write this data N times we can reasonably expect to read it it again."
ACM digital library link - http://dl.acm.org/citation.cfm?id=2056434
The paper - http://research.cs.wisc.edu/adsl/Publications/cce-dsn11.pdf
File systems with end-to-end checking are good. (These can turn into accidental memory tests for your host, too, and with a large enough population you'll see interesting failures).
OS X does a fake fsync() if you call it with the defaults.
fsync in Mac OS X: Since in Mac OS X the fsync command does not make the guarantee that bytes are written, SQLite sends a F_FULLFSYNC request to the kernel to ensures that the bytes are actually written through to the drive platter. This causes the kernel to flush all buffers to the drives and causes the drives to flush their track caches. Without this, there is a significantly large window of time within which data will reside in volatile memory—and in the event of system failure you risk data corruption
So in summary, I believe that the comments in the MySQL news posting
are slightly confused. On MacOS X fsync() behaves the same as it does
on all Unices. That's not good enough if you really care about data
integrity and so we also provide the F_FULLFSYNC fcntl. As far as I
know, MacOS X is the only OS to provide this feature for apps that
need to truly guarantee their data is on disk.