
The space between theory and practice in distributed systems - bbgm
http://brooker.co.za/blog/2014/08/10/the-space-between.html
======
mcguire
Yep, that's been a problem for decades, not just in distributed systems but
perhaps _especially_ in distributed systems. I'm practically overjoyed to see
things like "Formal methods at Amazon Web Services", because it means at least
some kind of bridge between the two exists.

My suspicion, outside of johnparkerg's polarization, is that distributed
systems in practice are particularly messy from a formal standpoint, in
contrast to non-distributed systems, while in practice you can live with the
messiness, if you can reduce it a sufficient number of 9's, which seems to be
anathema to theoretical approaches.

For example, the proof of the CAP theorem is irrelevant in practice,
specifically because no system I'm aware of makes the strong consistency
assumptions that it requires. On the other hand, CAP _behavior_ is definitely
a problem, once you reach a certain size.

~~~
api
There is a parallel divide in AI: "neats" vs "scruffies". Neats like systems
that are provably perfect and formal, while scruffies do stuff like multiple
superposed algorithms with fudge factors in between that are set by a genetic
algorithm. Formalists run away screaming but it works "sometimes" where
sometimes is the problem domain they are working within.

"But it's not mathematically stable!"

"It has been for the last month..."

------
chubot
_If FLP says consensus is impossible with one faulty process, and faults
happen all the time in practice, how are real systems built with consensus?_

Good question. I think the answer is that you don't build large systems using
consensus; you bootstrap large systems with very small systems using
consensus. The very small systems are reasonably assumed to have no Byzantine
faults (i.e. just like you more or less rely on a single database server not
to have faults).

All the systems I know of that use consensus are meant to be small, e.g. run
on 5 machines or so (and of course the membership is fixed). Google's Chubby,
Yahoo's ZooKeeper, and similar systems like doozer and etcd all work like
this.

Consensus doesn't "scale" anyway (the latency isn't bearable). If you only
have 5 machines, the likelihood of Byzantine faults over a reasonably long
time period is low. The main problem you will see is your own software bugs
(i.e. not bugs due to faulty CPU, memory, disks, switches, etc.).

~~~
dgacmu
Apparently Marc is right about the need for more of this - the answer is that
the kind of nasty timing of message losses that can render consensus
impossible don't happen, by and large, for very long in practice. It's not
necessary to have Byzantine failures to prevent consensus, it's a consequence
of the arbitrary network behavior allowed in an asynchronous communications
model.

I should add this question to one of my exams and see if my students get it
right. I'll be a little grumpy if they don't, but it's a great question. :)

~~~
chubot
What's an example of "timing of message losses" that would make consensus
impossible? I didn't think say Paxos depended on any such assumptions.

I suppose you can mathematically construct some kind of non-total message loss
such that Paxos would never make progress. But that kind of message loss won't
persist forever in a real system... it would basically be message loss with
knowledge of the algorithm?

~~~
grogers
Racing proposers cause paxos to fail to converge.

A: Prepare(1)

B: Prepare(2)

A: Accept(1, Va) # fails.

A: Prepare(3)

B: Accept(2, Vb) # fails.

Etc... In practice randomized backoffs when this happens causes it to converge
extremely fast, but the exact right situation COULD cause it to delay forever.
Don't lose sleep over it though.

~~~
chubot
Yeah that is basically what I was imagining, but there's no failure condition
that matches that in reality. You might get unlucky and fail to converge for
10 iterations, but not for say 10,000.

It is sort of Byzantine in that the "system" is inserting very specific
message losses with knowledge of the algorithm.

------
johnparkerg
I think this is just a reflection of the gap between academia and the real
world. One could argue that the problem is the lack of a 'trial and error'
philosophy in 'theory and practice' or that scientists couldn't possibly
develop so much if they were bothered with the real world.

I believe the true problem lies in our need to categorize people as either
scientists and theoretical engineers or down-to-earth engineers, which only
polarizes the spectrum.

------
vicaya
FLP/CAP merely illustrates the fundamental trade-off between
"safety/consistency" and "liveliness/availability" of distributed systems.

There is a spectrum of design choices between strong consistency with best
effort availability and best effort everything that make sense to many
practical use cases.

------
erikb
I think the middle ground between theory and practice has gaps in most areas
of expertise. The problem may be that to create good content in those gaps one
needs to have accomplished a certain level of skill in both areas, theory and
practise. Very few people get around to actually step aside from their daily
work (which usually is pure theory or pure practise) to achieve something on
the other end, though.

------
api
I would add a ninth fallacy to the eight: CPU is free. This is effectively
true on desktops and possibly on servers if task priorities are used, but on
mobile devices including laptops CPU eats battery life.

