
Recovery-Oriented Computing - llambda
http://roc.cs.berkeley.edu/roc_overview.html
======
quadhome
Recovery-Oriented Computing has hit HN before. Instead of surfing the overview
page, check out the toplevel with links to a LOT of fantastic and accessible
papers and presentations:

<http://roc.cs.berkeley.edu/>

Then, move on to James Hamilton[1]'s "On Designing and Deploying Internet-
Scale Services" which is a 10 minute read that's HOWTO on putting Recovery-
Oriented Computing into practice:

[http://static.usenix.org/event/lisa07/tech/full_papers/hamil...](http://static.usenix.org/event/lisa07/tech/full_papers/hamilton/hamilton_html/)

That paper changed the entire way I design and evolve systems. And I keep
seeing the lessons from it come up in less-distilled forms in other places.
For example, Zach Holman of Github's recent and great "How To Build A Github"

<http://zachholman.com/talk/how-to-build-a-github>

[1] Vice President and Distinguished Engineer on the Amazon Web Services team
where he is focused on infrastructure efficiency, reliability, and scaling.
<http://www.mvdirona.com/jrh/work/>

------
einhverfr
I have a forthcoming blog post about the development of double entry
accounting systems in the late Middle Ages and the way in which these are
recovery-oriented systems with an eye on error detection and recovery. Of
course these were on paper instead of digitally, but I think some of the same
perspectives apply.

Double entry accounting always balances because it takes the viewpoint of a
business with no intrinsic value, where debt owed to business and assets owned
by the business (debits, lit. "thing owed") equals debts owed by the business
(credits, "thing entrusted to someone else", either to creditors or to
stockholders). Because everything is recorded from the counterparty's
perspective and the business has no perspective, the system is robust. I
further wonder if this isn't a natural development from medieval tally sticks
to track debt.

But one thing that occurred to me is that it is hard to engineer this sort of
system. Double entry accounting systems were probably an organic development
based on the opening up of literacy outside the Church in the later Middle
Ages. Pacioli only wrote about a system already in use. He did not invent it.
It seems to me that over the next hundred years or so we can expect to see our
approaches to computing evolve in a similar manner. The big challenge is that
unlike building on tally sticks, we don;t have clear examples of how to do
this with the computing problems of today.

~~~
chubot
Sounds cool, post it on HN when you're done.

Along the same lines, Pat Helland's paper "Building on Quicksand" says we
should take inspiration from the paperwork used in business processes (like
carbon copies, etc.):

[http://masterthefundamentals.rstata.org/2011/10/10/building-...](http://masterthefundamentals.rstata.org/2011/10/10/building-
on-quicksand/)

Real life doesn't have distributed transactions! I'm sort of convinced that in
10 years, distributed transactions will be viewed as naive in the way that RPC
is (or is starting to be). Both are trying to make distributed computing look
like single-machine computing. But they're just inherently different.

Oh also there is "starbucks doesn't used 2 phase commit":
<http://www.eaipatterns.com/ramblings/18_starbucks.html>

EDIT: Regarding your last point, I don't think it's inherently difficult to
come up with these kinds of systems. Programmers just have to let go of the
idea that there's a single truth and they control it all. That the program
knows all. They have to have "empathy" with individual actors in the system.
Each actor is dealing with participants it doesn't necessarily trust. The
programmer has to put himself in the position of each actor when writing that
code.

~~~
einhverfr
By engineering, of course, I mean engineering from scratch.

On that last point though (splitting this off), I think there are two
difficulties in engineering it. First these are fundamentally more complex
systems. The more complex a system is the harder it is to spot all points of
failure. Additionally more complexity fundamentally means that that failure
cases are more complex.

Instead I think that the way most of these systems develop is through
progressive, evolutionary invention. Ok, I have tally sticks. I can write
about them in a journal and use arithmetic to track the totals. I will write
the stock on the left, referring to the thing owed (Latin "Debit") and the
smaller foil on the right, referring to what is loaned to me (Latin "Credit").

Ok, now I need to categorize these. So I will start putting a code by each
one. Now I have a chart of accounts! Now I need to total each of my accounts
so I transcribe into another book and now I have a ledger. Do the books
balance? Let me add them up and find out. Let's call this a trial balance. And
so forth. The utility arises largely from a large number of different people
working with real issues to find other uses for the system.

We see this with the evolution of SQL over a period of time as well. SQL today
may not have changed in its outlines since it was developed but the standards
have evolved a lot to meet needs.

So the challenge is for something like a disconnected POS. Again a system like
the real world invoice and cash voucher system might work best.

------
hewitt
While I was going over some of their goals I immediately started thinking
about erlang.

* Isolation and Redundancy: Processes are isolated. If some component crases it won't affect the rest of the system.

* Design for high modularity, measurability, and restartability: You write erlang otp apps which you then combine (modularity). You can easily get a live shell to any erlang system so you can measure various parameters (measurability). Using erlang/otp libraries you implement supervisors such that when a component of your system crashes you can just restart it without any interference with the rest of the system. (restartability)

I'm not even close to being proficient in erlang but weren't these some of the
issues erlang was designed to solve in the first place?

~~~
jerf
Yes, the key to grokking Erlang is understanding that all of the good design
decisions come from reliability, not concurrency. Concurrency is to serve
reliability, not the other way around.

------
ww520
Similar to crash-oriented computing - design your app to be crash ready,
automatic recovery at the next startup.

Undo is nice and all that but some operations cannot be undone, e.g. emails
sent. Compensation transaction is as important as roll back transaction.

~~~
3amOpsGuy
Yeah i have to agree. Crash oriented or crash-only design is to my mind more
useful than 'only' recovery oriented - it's also realistically achievable.
I've taken to implementing my middleware this way since I've started working
against Cassandra back-ends (also crash only design).

