

Jepsen: Testing the Partition Tolerance of PostgreSQL, Redis, MongoDB and Riak - sethev
http://www.infoq.com/articles/jepsen

======
mjb
The exchange between Antirez and aphyr following the post about Redis sentinel
is a fascinating comparison between two engineering approaches. Antirez makes
a qualitative argument
([http://antirez.com/news/56](http://antirez.com/news/56), especially
[http://antirez.com/news/56#comment-910996445](http://antirez.com/news/56#comment-910996445))
about the behavior of the system in some 'real world' where complex network
partitions are rare. On the other hand, aphyr made a much more theoretically
sound argument (including using TLA+ to demonstrate the validity of his
argumement) in his post ([http://aphyr.com/posts/287-asynchronous-replication-
with-fai...](http://aphyr.com/posts/287-asynchronous-replication-with-
failover)).

Despite having a huge amount of respect for Antirez and Redis, I strongly
believe that the approach aphyr took is the one we are going to need as we
build larger and more complex systems on unreliable infrastructure. Our
engineering intuition, as excellent as it may be for single-node systems,
almost always fails us with distributed systems. To get around this, we need
to replace intuition. The tools that aphyr uses, such as TLA+ and carefully
crafted counterexamples and diagrams, are an extremely good start in that
direction. Getting a computer (in this case TLA+'s model checker TLC) to
exhaustively test a design specification is very powerful. Comparing those
results to the ones that we expected is even more powerful.

The comment made by Metaxis
([http://antirez.com/news/56#comment-905001533](http://antirez.com/news/56#comment-905001533))
on Antirez's second reply is very good. Especially:

> I think your attempt to differentiate formal correctness and real world
> operations is deeply flawed and amounts to asserting anecdote - that what
> you have observed to be common makes for good design assumptions and better
> trade off decisions.

> Allow me to counter: Real world operations will inevitably tend to approach
> formal correctness in terms of observed failure modes. In other words, over
> time, you are more and more likely to see edge cases and freak occurrences
> that are predicted in theory but happen rarely in practice.

This closely matches my own experience. Just because I don't believe a network
can behave in a particular way, doesn't mean it won't. The real world is full
of complex network partitions and Byzantine failures, and our systems need to
be safe when they happen

~~~
TylerE
That all may be true - but is it asking the wrong question?

It can be tempting to stand on an ivory tower and proclaim theory, but what is
the real world cost/benefit? Are you building a NASA Shuttle Crawler-
transporter to get groceries?

~~~
mjb
> It can be tempting to stand on an ivory tower and proclaim theory, but what
> is the real world cost/benefit?

That's an important question. The real world costs of fulling understanding
the behavior of a system like Redis, amortized over all the users of that
system is likely to be very low. That's a huge benefit of using well-tested
pieces (whether they are Redis, Cassandra or Oracle) as the foundation of
systems.

The important question to ask is what you are losing. Losing availability in
these complex cases is acceptable for the vast majority of applications. You
can save a lot of complexity and cost there. Losing safety is much more of a
problem, because the effects can be very long lived. Once you have put
inconsistent data into your database, you are in a world of hurt for
potentially a very long time.

I think that it's actually cheaper in the long run, even for small systems
with small goals, to build on top of safe components. The costs of writing
code to deal with the shifting reality of inconsistent data is higher, as are
the costs of not writing that code.

I don't think that proving the safety of protocols, testing those safety
properties once implemented, and understanding failure modes is "ivory tower"
at all. It's just good engineering practice.

------
packetbeats
Fantastic in depth article. In my opinion, the tests he perform show not so
much drawbacks of the tested systems, but more the fact that defining them in
terms of the CAP theorem can be misleading. For example, a CP system should,
in case of a partition, wait until the partition is resolved, potentially
forever. This is not useful in practice, where almost always some timeout is
used. This is why, if I'm interpreting the results correctly, not even a
Postgres running on a single node can claim to be CP.

~~~
aphyr
_a CP system should, in case of a partition, wait until the partition is
resolved, potentially forever_

Not always. In the presence of up to N/2-1 failures (or alternatively, so long
as a connected majority exists), many CP systems can continue to operate fully
on some nodes. Other nodes may not be able to service requests, though.

 _This is not useful in practice, where almost always some timeout is used._

Exactly--if you broaden your time horizons for a request so that messages
_always_ arrive, you'll be OK. As far as I can tell, this is NuoDB's approach.
On the other hand, it means the database blocks some or all requests,
potentially forever. This is essentially unavailability, with the added fun
that all those requests might suddenly complete well after the client thinks
they failed. When your clients need to hand a response back in a matter of
seconds, you _need_ to provide for timeouts--which brings you back to the
problem of dropped messages again. :)

 _Postgres running on a single node can claim to be CP._

The Postgres node proper is (assuming you're using the right transaction
isolation level) linearizable. The client and the server, considered together,
may not be CP--depends on how you interpret network failures. I don't think
there's any real way around this--to the best of my limited knowledge, it's a
consequence of the FLP impossibility result.

~~~
zzzcpan
So, essentially nothing can ever be CP, not even a single node, since browsers
and humans will never wait forever for something, correct?

~~~
aphyr
C is a _safety_ property ("nothing bad happens"), but what you're talking
about is (kind of) a _liveness_ property ("something good happens"). Real
systems need both, so we have to discuss what to do about making correct
progress in the face of failures.

Theoretical CP systems can't just block indefinitely, because the network can
delay messages for 1 second longer than you choose to wait. Real-world CP
systems can usually assume the network recovers--but it might be days before
it does, so they'll need some strategy to handle all the requests in the mean
time.

Typically, systems choose to time out or otherwise abort certain requests--
exactly which depends on the logical topology of the system. For instance, if
the primary node for a given shard of the DB is unreachable from a client, a
CP-over-shard system would fail those requests--but not others. A totally CP
system like ZK might fail requests in any minority components, but continue to
work just fine if a majority component is preserved.

Bottom line: you can build CP systems. The only constraint a CP system has to
obey is linearizability: successful operations must appear in the same order
on all nodes.

------
mdellabitta
I think (╯°□°）╯︵ ┻━┻ and ヽ(´ー｀)ノ need to be declared as constants in all
logging libraries and used accordingly.

~~~
BHSPitMonkey
Let's hope for some tight integration with Android's Log.wtf method:

[http://developer.android.com/reference/android/util/Log.html...](http://developer.android.com/reference/android/util/Log.html#wtf\(java.lang.String%2C%20java.lang.String\))

------
Roboprog
Interesting, but needs a summary to help conceptualize the details which it
slogs through.

~~~
aphyr
TL;DR: designing distributed systems is hard, so think carefully about your
problem space, required invariants, and risk tolerance; then test your systems
accordingly.

~~~
snaky
TL;DR: first rule of engineering reliable distributed systems: "do not
distribute"

