Hacker News new | past | comments | ask | show | jobs | submit login
On the Complexity of Crafting Crash-Consistent Applications (2014) [pdf] (usenix.org)
36 points by pcr910303 8 days ago | hide | past | web | favorite | 7 comments

"Crash-consistent applications", not "Crashing applications". This is a great paper which illustrates how a developer's mental model of what a file system does may differ from the reality.

The submitted title was "File Systems Are Not Created Equal: Complexity of Crafting Crashing Applications [pdf]", which was a valiant attempt at squeezing the whole thing into 80 chars, but I don't think that's doable.

I was once responsible for designing mission critical embedded application that would be guaranteed to withstand power cycle or crash at any time and continue from the point it interrupted.

Wasn't very hard but it required rigorous design. The important pieces were state machines that were committed to a log based storage at significant points in operation.

Then there is detection and handling of repeating crashes which is where it really gets interesting.

It is not very well known but the Apollo Guidance Computer had this same capability! The famous program alarms (1202 and 1201) during the Apollo 11 descent were caused by low-priority jobs stealing processing power from the high priority ones. The operating system was designed to just restart the computer in those cases (this would kill the low priority jobs) and continue execution from where it left off:

> The software rebooted and reinitialized the computer, and then restarted selected programs at a point in their execution flow near where they had been when the restart occurred.

See [0] for more details.

[0] https://www.hq.nasa.gov/alsj/a11/a11.1201-pa.html

I began design and development with research on other projects with similar requirements and AGC was one of them, I remember.

The difference in my case was that I did not have simple non-volatile memory like AGC had, I had flash chip with no wear leveling. To protect the flash from premature death I had to design my own database to store data in form of deltas appended to the file. When it started it would scan from the beginning of the file until last complete record and start from there. When the data in the file reached preset size all live records would simply be rewritten to beginnig of another file.

Says 2014. Has anything been done with, say Git or SQLite since to make them more robust against filesystem fragility?

I'm fairly confident that SQLite deals well with filesystem crashes nowadays, see https://www.sqlite.org/testing.html and https://www.sqlite.org/atomiccommit.html

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact