Hacker News new | past | comments | ask | show | jobs | submit login
Jepsen: Testing the Partition Tolerance of PostgreSQL, Redis, MongoDB and Riak (infoq.com)
76 points by sethev on June 20, 2013 | hide | past | favorite | 16 comments

The exchange between Antirez and aphyr following the post about Redis sentinel is a fascinating comparison between two engineering approaches. Antirez makes a qualitative argument (http://antirez.com/news/56, especially http://antirez.com/news/56#comment-910996445) about the behavior of the system in some 'real world' where complex network partitions are rare. On the other hand, aphyr made a much more theoretically sound argument (including using TLA+ to demonstrate the validity of his argumement) in his post (http://aphyr.com/posts/287-asynchronous-replication-with-fai...).

Despite having a huge amount of respect for Antirez and Redis, I strongly believe that the approach aphyr took is the one we are going to need as we build larger and more complex systems on unreliable infrastructure. Our engineering intuition, as excellent as it may be for single-node systems, almost always fails us with distributed systems. To get around this, we need to replace intuition. The tools that aphyr uses, such as TLA+ and carefully crafted counterexamples and diagrams, are an extremely good start in that direction. Getting a computer (in this case TLA+'s model checker TLC) to exhaustively test a design specification is very powerful. Comparing those results to the ones that we expected is even more powerful.

The comment made by Metaxis (http://antirez.com/news/56#comment-905001533) on Antirez's second reply is very good. Especially:

> I think your attempt to differentiate formal correctness and real world operations is deeply flawed and amounts to asserting anecdote - that what you have observed to be common makes for good design assumptions and better trade off decisions.

> Allow me to counter: Real world operations will inevitably tend to approach formal correctness in terms of observed failure modes. In other words, over time, you are more and more likely to see edge cases and freak occurrences that are predicted in theory but happen rarely in practice.

This closely matches my own experience. Just because I don't believe a network can behave in a particular way, doesn't mean it won't. The real world is full of complex network partitions and Byzantine failures, and our systems need to be safe when they happen

That all may be true - but is it asking the wrong question?

It can be tempting to stand on an ivory tower and proclaim theory, but what is the real world cost/benefit? Are you building a NASA Shuttle Crawler-transporter to get groceries?

> It can be tempting to stand on an ivory tower and proclaim theory, but what is the real world cost/benefit?

That's an important question. The real world costs of fulling understanding the behavior of a system like Redis, amortized over all the users of that system is likely to be very low. That's a huge benefit of using well-tested pieces (whether they are Redis, Cassandra or Oracle) as the foundation of systems.

The important question to ask is what you are losing. Losing availability in these complex cases is acceptable for the vast majority of applications. You can save a lot of complexity and cost there. Losing safety is much more of a problem, because the effects can be very long lived. Once you have put inconsistent data into your database, you are in a world of hurt for potentially a very long time.

I think that it's actually cheaper in the long run, even for small systems with small goals, to build on top of safe components. The costs of writing code to deal with the shifting reality of inconsistent data is higher, as are the costs of not writing that code.

I don't think that proving the safety of protocols, testing those safety properties once implemented, and understanding failure modes is "ivory tower" at all. It's just good engineering practice.

I tried to address that question in the full version of the article, but had to condense it somewhat for the InfoQ post. http://aphyr.com/posts/286-call-me-maybe-final-thoughts

Fantastic in depth article. In my opinion, the tests he perform show not so much drawbacks of the tested systems, but more the fact that defining them in terms of the CAP theorem can be misleading. For example, a CP system should, in case of a partition, wait until the partition is resolved, potentially forever. This is not useful in practice, where almost always some timeout is used. This is why, if I'm interpreting the results correctly, not even a Postgres running on a single node can claim to be CP.

a CP system should, in case of a partition, wait until the partition is resolved, potentially forever

Not always. In the presence of up to N/2-1 failures (or alternatively, so long as a connected majority exists), many CP systems can continue to operate fully on some nodes. Other nodes may not be able to service requests, though.

This is not useful in practice, where almost always some timeout is used.

Exactly--if you broaden your time horizons for a request so that messages always arrive, you'll be OK. As far as I can tell, this is NuoDB's approach. On the other hand, it means the database blocks some or all requests, potentially forever. This is essentially unavailability, with the added fun that all those requests might suddenly complete well after the client thinks they failed. When your clients need to hand a response back in a matter of seconds, you need to provide for timeouts--which brings you back to the problem of dropped messages again. :)

Postgres running on a single node can claim to be CP.

The Postgres node proper is (assuming you're using the right transaction isolation level) linearizable. The client and the server, considered together, may not be CP--depends on how you interpret network failures. I don't think there's any real way around this--to the best of my limited knowledge, it's a consequence of the FLP impossibility result.

So, essentially nothing can ever be CP, not even a single node, since browsers and humans will never wait forever for something, correct?

C is a safety property ("nothing bad happens"), but what you're talking about is (kind of) a liveness property ("something good happens"). Real systems need both, so we have to discuss what to do about making correct progress in the face of failures.

Theoretical CP systems can't just block indefinitely, because the network can delay messages for 1 second longer than you choose to wait. Real-world CP systems can usually assume the network recovers--but it might be days before it does, so they'll need some strategy to handle all the requests in the mean time.

Typically, systems choose to time out or otherwise abort certain requests--exactly which depends on the logical topology of the system. For instance, if the primary node for a given shard of the DB is unreachable from a client, a CP-over-shard system would fail those requests--but not others. A totally CP system like ZK might fail requests in any minority components, but continue to work just fine if a majority component is preserved.

Bottom line: you can build CP systems. The only constraint a CP system has to obey is linearizability: successful operations must appear in the same order on all nodes.

I think (╯°□°)╯︵ ┻━┻ and ヽ(´ー`)ノ need to be declared as constants in all logging libraries and used accordingly.

Let's hope for some tight integration with Android's Log.wtf method:


Interesting, but needs a summary to help conceptualize the details which it slogs through.

TL;DR: designing distributed systems is hard, so think carefully about your problem space, required invariants, and risk tolerance; then test your systems accordingly.

TL;DR: first rule of engineering reliable distributed systems: "do not distribute"

The original (non-Infoqd) article's conclusion page http://aphyr.com/posts/286-call-me-maybe-final-thoughts was quite nice.

Antirez responded to the original article citing that Sentinel, which was used for this test, was never designed to accomplish what was tested. So the Redis portion of that article is rather misleading: http://antirez.com/news/55

I am confident from theoretical argument [1] and experiment [2] that any system for failover between asynchronously replicated Redis nodes (in the absence of a strong external coordinator for requests) is subject to this kind of data loss and inconsistency.

If you read the post carefully, and Antirez's followup [3], I think you'll find Salvatore's claim is not that Redis is robust to partition, but that he doesn't think the risk is significant enough to address. I disagree: partitions are a real challenge for distributed systems at all scales. [4]

[1] http://aphyr.com/posts/287-asynchronous-replication-with-fai...

[2] http://aphyr.com/posts/283-call-me-maybe-redis

[3] http://antirez.com/news/57

[4] http://aphyr.com/posts/288-the-network-is-reliable

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact