
Disks lie. And the controllers that run them are partners in crime. - pwg
http://queue.acm.org/detail.cfm?id=2367378
======
ChuckMcM
During my tenure at NetApp I got to see all sorts of really really interesting
disk problems and the lengths software had to go to reliably store data on
them. Two scars burned into me from that were 1) disks suck, and 2) disks are
not 'storage'.

The first part is pretty easy to understand, storage manufacturers have
competed for years in a commodity market where consumers often choose price
per gigabyte over URER (the unrecoverable read error rate), further at the
scale of the market small savings of cents adds up to better margins. And
while the 'enterprise' fibre channel and SCSI drives could (and did) command a
hefty premium, the shift to SATA drives really put a crimp in the over all
margin picture. So the disk manufacturers are stuck between the reliability of
the drive and the cost of the drive. They surf the curve where it is just
reliable enough to keep people buying drives.

This trend bites back, making the likely hood of an error while reading more
and more probable. Not picking on Seagate here, they are all similar, but lets
look at their Barracuda drive's spec sheet [1]. You will notice a parameter
'Nonrecoverable Read Errors per Bits read', and you'll see that its 1x10e14
across the sheet, from 3TB down to 250GB. It is a statistical thing, the whole
magnetic field domain to digital bit pipeline is one giant analog loop of a
error extraction. 1x10E14 _bits_. So what does that mean? Lets say each byte
is encoded in 10 bits on the disk. Three trillion bytes is 30 trillion bits in
that case, or 3x10E13 bits. Best case, if you read that disk from one end to
the other (like you were re-silvering a mirror from a RAID 10 setup) you have
a 1 in 3 chance that one of those sectors won't be readable _for a perfectly
working disk._ Amazing isn't it? Trying to reconstruct a RAID5 group with 4
disks remaining out of 5, look out.

So physics is not on your side, but we've got RAID 6 Chuck! Of course you do,
and that is a good thing. But what about when you write to the disk, the disk
replies "Yeah sure boss! Got that on the platters now" but it didn't actually
right it? Now you've got a parity failure waiting to bite you sitting on the
drive, or worse, it _does_ write the sector but writes it somewhere else (saw
this a number of times at NetApp as well). There are _three million lines of
code_ in the disk firmware. One of the manufacturers (my memory says Maxtor)
showed a demo where they booted Linux on the controller of the drive.

Bottom line is that the system works mostly, which is a huge win, and a lot of
people blame the OS when the drive is at fault, another bonus for
manufacturers, but unless your data is in a RAID 6 group or on at least 3
drives, its not really 'safe' in the "I am absolutely positive I could read
this back" sense of the word.

[1]
[http://www.seagate.com/files/staticfiles/docs/pdf/datasheet/...](http://www.seagate.com/files/staticfiles/docs/pdf/datasheet/disc/barracuda-
ds1737-1-1111us.pdf)

~~~
biot
What were the least problematic drives in your experience?

~~~
ChuckMcM
The ones that had been shipping for a few years. Like many things they all
start life the least stable, go through several firmware revisions, maybe a
couple of hardware revisions, and then everyone hits about the same level of
reliability if they don't change them further.

~~~
wazoox
Yup, last time I've been beaten was with "brand new fresh from the factory new
model" 3 TB drives. The crappy firmware failed badly under load and I nearly
lost 60 TB. Recent firmwares pose no problem, though.

OTOH 1 and 2 TB from HGST were absolutely rock-solid from the start. Only lost
a handful of them among several thousand installed. By contrast in the past
years I've had a steady 50% failure rate on Barracuda ES2, the shittiest drive
on earth since the legendary 9 GB Micropolis.

~~~
InclinedPlane
I had so much trouble with those seagates that it put me off ever buying from
them again.

------
rarrrrrr
At SpiderOak, every storage cluster writes new customer data to three RAID6
volumes on 3 different machines, which wait for fdatasync, before we consider
it written.

... and before that begins, the data is added to a giant ring buffer that
records the last 30 TB of new writes, on machines with UPS. If a rack power
failure happens, the ring buffer is kept until storage audits complete.

I think every company that deals with large data they can't lose develops
appropriate paranoia.

The behavior of hard drives is like decaying atoms. You can't make accurate
predictions about what any one of them will do. Only in aggregate can you say
something like "the half life of this pile of hardware is 12 years" or "if we
write this data N times we can reasonably expect to read it it again."

~~~
baruch
Interesting, do you do disk scrubbing to make sure the data is still out there
after a while? That would be the key method to ensure you can still recover
the data upon need.

------
scdeepak
Me and my colleagues had worked on a solution to this problem a few years
back. The results were published and is linked here.

ACM digital library link - <http://dl.acm.org/citation.cfm?id=2056434>

The paper - <http://research.cs.wisc.edu/adsl/Publications/cce-dsn11.pdf>

~~~
baruch
That's a really neat hack, but how many disks have you seen that do not adhere
to their write-cache-disabled setting?

------
jat850
I wish I could post the video from the talk, but here are some slides at
least. Good supporting talk on the pitfalls of making assumptions about how
disks work, from OSCON '11:

[http://www.slideshare.net/iammutex/what-every-data-
programme...](http://www.slideshare.net/iammutex/what-every-data-programmer-
needs-to-know-about-disks)

------
kabdib
I had fun once diagnosing a bad bit in a drive's cache memory. Writes would go
out, and _sometimes_ come back with a bit set. All the disk-resident CRCs in
the world won't help if your data is mangled _before_ it makes it to the
media.

File systems with end-to-end checking are good. (These can turn into
accidental memory tests for your host, too, and with a large enough population
you'll see interesting failures).

~~~
baruch
At the storage level there is T10 DIF that is supposed to help with such
things (among many other things), though it is used in a very limited fashion
and is only supported on the higher end disks.

------
recoiledsnake
The article glosses over the fact that fsync() itself has major issues. For
example, on ext3, if you call fsync() on a single file, _all_ file system
cached data is written to disk, leading to an extreme slowdown. This led to a
Firefox bug, when the sqlite db used for the awesome bar and for bookmarks
called fsync() and slowed everything down.

[http://shaver.off.net/diary/2008/05/25/fsyncers-and-
curvebal...](http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/)

OS X does a fake fsync() if you call it with the defaults.
[https://developer.apple.com/library/mac/#documentation/Cocoa...](https://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/CoreData/Articles/cdPersistentStores.html)

fsync in Mac OS X: Since in Mac OS X the fsync command does not make the
guarantee that bytes are written, SQLite sends a F_FULLFSYNC request to the
kernel to ensures that the bytes are actually written through to the drive
platter. This causes the kernel to flush all buffers to the drives and causes
the drives to flush their track caches. Without this, there is a significantly
large window of time within which data will reside in volatile memory—and in
the event of system failure you risk data corruption

~~~
btrask

      So in summary, I believe that the comments in the MySQL news posting
      are slightly confused.  On MacOS X fsync() behaves the same as it does
      on all Unices.  That's not good enough if you really care about data
      integrity and so we also provide the F_FULLFSYNC fcntl.  As far as I
      know, MacOS X is the only OS to provide this feature for apps that
      need to truly guarantee their data is on disk.
    

The full post goes into detail. [http://lists.apple.com/archives/darwin-
dev/2005/Feb/msg00072...](http://lists.apple.com/archives/darwin-
dev/2005/Feb/msg00072.html)

