
On the Complexity of Crafting Crash-Consistent Applications (2014) [pdf] - pcr910303
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf
======
dfryer
"Crash-consistent applications", not "Crashing applications". This is a great
paper which illustrates how a developer's mental model of what a file system
does may differ from the reality.

~~~
dang
The submitted title was "File Systems Are Not Created Equal: Complexity of
Crafting Crashing Applications [pdf]", which was a valiant attempt at
squeezing the whole thing into 80 chars, but I don't think that's doable.

------
lmilcin
I was once responsible for designing mission critical embedded application
that would be guaranteed to withstand power cycle or crash at any time and
continue from the point it interrupted.

Wasn't very hard but it required rigorous design. The important pieces were
state machines that were committed to a log based storage at significant
points in operation.

Then there is detection and handling of repeating crashes which is where it
really gets interesting.

~~~
elteto
It is not very well known but the Apollo Guidance Computer had this same
capability! The famous program alarms (1202 and 1201) during the Apollo 11
descent were caused by low-priority jobs stealing processing power from the
high priority ones. The operating system was designed to just restart the
computer in those cases (this would kill the low priority jobs) and continue
execution from where it left off:

> The software rebooted and reinitialized the computer, and then restarted
> selected programs at a point in their execution flow near where they had
> been when the restart occurred.

See [0] for more details.

[0]
[https://www.hq.nasa.gov/alsj/a11/a11.1201-pa.html](https://www.hq.nasa.gov/alsj/a11/a11.1201-pa.html)

~~~
lmilcin
I began design and development with research on other projects with similar
requirements and AGC was one of them, I remember.

The difference in my case was that I did not have simple non-volatile memory
like AGC had, I had flash chip with no wear leveling. To protect the flash
from premature death I had to design my own database to store data in form of
deltas appended to the file. When it started it would scan from the beginning
of the file until last complete record and start from there. When the data in
the file reached preset size all live records would simply be rewritten to
beginnig of another file.

------
ncmncm
Says 2014. Has anything been done with, say Git or SQLite since to make them
more robust against filesystem fragility?

~~~
ric129
I'm fairly confident that SQLite deals well with filesystem crashes nowadays,
see [https://www.sqlite.org/testing.html](https://www.sqlite.org/testing.html)
and
[https://www.sqlite.org/atomiccommit.html](https://www.sqlite.org/atomiccommit.html)

