

Crash-tolerant data storage - Tomte
https://news.mit.edu/2015/crash-tolerant-data-storage-0824

======
kabdib
I can't remember the last time I lost data to a file system bug. I've
certainly lost data to applications running on top of file systems ("Source
Safe" \-- when they put "Safe" in the product name, you _know_ you have to run
away...)

All the stuff I've lost has been due to:

\- Hardware failure (vanilla smoke-from-the-disk, or RAID controller failure,
or too many RAID volumes dying at once)

\- Fat fingers (rm -rf equivalents, or format the wrong drive)

\- Data is ancient and in a proprietary format, and the apps that can read it
won't run any more (at least, not without a great deal of effort)

\- Butterfingers (dropped disk on a concrete floor)

\- Media age (drive won't turn / oxide got tired, lost adhesion and rubbed
off), or treating media badly (a decade of alternating hot/cold in the garage
next to the spiders is not good)

\- Vanishing interface standard (no EISA slot, or nowadays, no parallel port),
or no drivers for modern OSes

I'm happy there's a provably correct file system out there.

But it ain't gonna save you :-)

~~~
mwcremer
"The hard drive is ON FIRE!"

~~~
kabdib
One fine day I came into work and overnight my Gateway workstation's boot
drive had developed 10,000 bad sectors (maybe more, who counts these things?)

Who indeed?

I called Gateway tech support and said, "Hi, overnight my hard drive got over
10,000 bad sectors and I'd like you to replace it, since it's under warranty."

Gateway support: (after going through some really tiresome support nonsense,
like power-cycling the machine and verifying a network connection of a
computer that can barely boot). "Sorry, we're not going to do that. Your hard
drive still has several hundred thousand perfectly good sectors, so it's
working fine, no replacement."

Me: (facepalm)

So eventually through some social engineering I got them to replace the drive
(a little play acting involving my manager pretending to yell at me, which got
instant sympathy from the Gateway support person. This is a useful tool).
Really I just went over to Fry's and bought a new hard drive. But it was the
principle of the thing.

And I resolvde that the next time I had to call in a hard drive failure, I
would say:

"Hello, Gateway? My hard drive is ON FIRE and my computer won't boot any more.
What do you want me to do?"

~~~
A010
I remember the one I got years ago
[https://www.youtube.com/watch?v=-e9OcWIH6N8](https://www.youtube.com/watch?v=-e9OcWIH6N8)

------
rdtsc
That is interesting -- using Coq to prove crash safety and consistency. It is
a bit in the spirit of how Kyle (Aphyr)'s Knossos tool checks databases in his
"Call Me Maybe" series.

There should definitely be an equivalent thing for data corruption. There sure
have databases and systems that made lofty marketing claims only to then
silently corrupt user data (really, one of the most terrible things that could
happen -- as it will corrupt your backups and might not be noticed for a
while).

Also due to proliferation of VMs/containers. The model checker would have to
include abnormal behaviors in both the host and the VM instance. Say pull the
power on the host or VM.

Looking at products that store data I have both designed my own, so I know
exactly how and when data checkpoints are happening or evaluated a few DBs. Or
I like CouchDB for an off-the-shelf example. It has an append-only and crash-
only design. Stopping can be done by kill -9 the process or backing can happen
anytime, take EBS snapshots anytime. You'll pay for it perhaps in raw write
performance but it has worked great for me.

