Hacker News new | past | comments | ask | show | jobs | submit login
You Can't Sacrifice Partition Tolerance (codahale.com)
136 points by codahale on Oct 7, 2010 | hide | past | favorite | 50 comments

I think the author, like many recently exposed to the CAP theorem, is confused about the meaning of partition tolerance, leading to ridiculous conclusions.

Partition tolerance does not mean your distributed system can't be consistent and available because your network dropped one packet, or one node failed. What would be the point of such a definition? Instead, the CAP theorem implies that while the network is partitioned, consistency or availability must be sacrificed. In the case of the dropped packet, once it is retransmitted the partition is healed and progress can be made. Or in the case of the failed node, nothing says that the rest of the system can't be consistent and available, so that the system as a whole maintains that property. There is no requirement that the unavailable node be available.

Truly partition tolerant systems are those that continue to function in the face of a prolonged partition, and those are the systems that must sacrifice either consistency or availability.

What you're saying is that so long as there are no partitions, the system can be Consistent and Available, but if there's a partition, it can't.

"Consistent sometimes" is not the same thing as "Consistent" and "Available sometimes" is not the same thing as "Available" - and so "Consistent and Available sometimes" is not the same as "Consistent and Available".

I believe you might be guilty of confusing "Eventual Consistency" with "Consistency".

Funnily enough, no-one has found much use for "Eventual Availability" so far.

Assuming eventual availability can be pretty handy -- one way deal with a dependency outage is to retry with an exponential backoff. If a dependency is unavailable now, and your system keeps retrying until it is, then you are assuming your dependency will become available again eventually.

Fair point - but of course your client is not making any progress, and so the unavailability ripples up. It's unlikely that there is user facing case where this is a useful way to work, though I can see it's use in loosely coupled connections between backends.

Not quite. What I'm saying is that a dropped packet or a failed node are not partitions as far as the CAP theorem is considered.

A distributed system is considered available if "every request received by a non-failing node [results] in a response." It does not mean you cannot retransmit or retry.

Similarly, the consistency guarantee only requires that there exist a total order on operations. Failures are ok, as we're allowed to retransmit, retry, and otherwise tolerate faults. There is no inconsistency, nor is anything eventual.

My point is that a "temporary" partition is just a fault, and as long as the fault is shorter than the allowed response time of the system, it doesn't make a difference.

No, dropped packets are partitions. They really are. A partitionable network is modelled as one which may fail to deliver any subset of sent messages between nodes. The Gilbert and Lynch paper makes this explicit.

The consistency guarantee requires that RW histories are compatible with some sequentially consistent history on a non-concurrent RW register. Defining a total order on operations is sufficient, I believe, but not necessary (does it matter what order two consecutive reads happened in?).

How do you explain Paxos, then? How does a dropped packet prevent the system from responding to queries? How about if I broadcast every response 10 times to everyone I know? How many packets must be dropped for the system to be considered unavailable?

Depends on the protocol, in general.

Paxos is, fundamentally, a quorum-based system that deals with reordering of messages. It sacrifices liveness for correctness - if the proposer does not hear back from a majority of nodes (in the case of, e.g. a partition), the protocol will not complete (availability is sacrificed).

My point is not that there is a 'vital packet' in every protocol, the omission of which will cause either a lack of availability or consistency (although I can certainly design protocols that way!) - it's that for every protocol there is a network partition which causes it to be either unavailable or inconsistent. That network partition might be dropping ten messages, or just one. Retransmitting would make sense, but in real life message failures are often highly temporally correlated :(

The proof of this, by the way, is in a very famous paper by Fischer, Lynch and Patterson called "The Impossibility of Distributed Consensus With One Faulty Process". One take away is that one slow-running process can take down any protocol. It may take a few missed messages, but only a single node...

CAPL: consistency, availability, partition tolerance, latency

Paxos sacrifices latency

Incorrect, paxos sacrifices availability. Paxos is consistent but does not necessarily ever terminate.

Actually, according to your own blog post on the subject, Gilbert and Lynch define a partition tolerance as:

    “The network will be allowed to lose arbitrarily many messages sent from one node to another”
There's a huge world of difference between the network losing arbitrarily many messages, and the definition you use elsewhere in this thread: namely, any subset of packets dropped, no matter how small, counts as a network partition.

No, arbitrarily many doesn't automatically mean a huge amount :) This definition covers permanent partitions, but also encompasses temporary partitions which are effectively one dropped message or more. There exist protocols which will be broken by the loss of a single message. Paxos may not be one of them, but there is also a pattern of loss which will break that as well.

The theory behind all this really does hold this point up. I have another blog post with much more detail on the theory here: http://the-paper-trail.org/blog/?p=49, but I warn you it may be heavy going.

Yes, a system is available if one node doesn't respond and you can contact another. But that node will be unable to guarantee consistency.

If a node you can contact is required to guarantee consistency, there will be some times that it will have to refuse your request because other nodes are not contactable.

The author's point was that in any distributed system there is a non-zero probability of a network failure. While both clients and server nodes can retry connections, there is a non-zero probability that the problem will persist longer than your "availability agreement" allows. In that case, you have a choice - return potentially inconstent data or refuse the request.

What you seem to be arguing is that the probabilities of failure - in particular of repeated failure - while non-zero, are effectively zero. The author would disagree (as they point out, the probabilities combine exponentially as the number of nodes increase.) I think he's right and that you are wrong.

Paxos is the quintessential example of a highly available, consistent system. It is available as long as more than half of the nodes are up and able to communicate with each other. It remains consistent, regardless of the failure pattern. You really do only have to worry about a true network partition. This isn't a probabilistic argument in any sense.

As you have yourself pointed out, Paxos will in some cases (when less than half the nodes are up) become unavailable, but remain consistent in all cases where it is available.

So it is tolerant of partition and it sacrifices availability in favour of consistency. So it is CP, not CA.

Of course it will become either unavailable or inconsistent (or both) during a network partition. That's the essence of the CAP theorem.

But what does it mean to tolerate a partition? As if the system has a choice?

Any CA system is claiming to be consistent and available as long as the network doesn't partition. That's the strongest statement you can make under the CAP theorem, and Paxos certainly falls in that camp.

My problem with the original article was that it claimed that any individual network or node failure was a partition affecting the consistency or availability of the system. Paxos is a clear counterexample to that, as it tolerates a lot more than that without sacrificing consistency or availability.

Once the network actually partitions (or half the nodes become unreachable), then you are correct. The CAP theorem comes into play again and we must sacrifice either C or A, and Paxos chooses A.

* it never becomes inconsistent (C) * it always returns either success or failure (A) * sufficiently severe partitions kill it dead (!P)

Or does any system subject to hardware or power failure fail to count as "available"?

I'm afraid you're not quite correct. CAP says that, in the face of potentially arbitrary network partitions (which are precisely modelled by a set of dropped messages) you can be 100% consistent, or 100% available, but you can't be both.

If the network is allowed to drop packets there are times where either you must not respond to requests (as doing so would violate sequential consistency) or you must respond incorrectly, potentially with stale information.

The network partition that forces you to drop one of these guarantees might be quite dramatic - but note that, for example, a quorum system is only available on the majority side of its partition. Therefore if a single node is unable to deliver messages (due to a network partition event), it will not be able to correctly respond to requests and must either not respond to its clients or respond incorrectly.

> I'm afraid you're not quite correct.

So then what am I missing? It sounds to me like the rest of your post is reiterating what you just called incorrect.

You said that wasn't what TFA was describing, but it is.

This is the most clear and well-written summary that I've seen of the tradeoffs presented by the CAP theorem. Hopefully it clears up a lot of the confusion out there.

I think this is the tweet that prompted the post: http://twitter.com/JamesMPhillips/status/26502076366

I think the yield/harvest concept described here is one of the more useful models I've heard about in a while for thinking about fault-tolerance tradeoffs. My thanks to @codahale for the write-up, and particularly for the references.

the best way i've heard it phrased is 'given the presence of a network partition, you must choose whether to maintain consistency or availability'. this does not mean that any network partition will make data unavailable if you choose C. it only means that some network partitions will make some data unavailable to some machines. picking A does not guarantee that all data will be available to all machines in the presence of a partition either. given a CP system with 3 way replication requiring a quorum to make progress, i would argue that the set of partitions in which data becomes unavailable yet would still be available had AP been chosen is very small and not worth worrying about. in systems designed to be up 100% of the time where partitions are the exception rather than the norm CP is almost always the right choice. in systems designed for network partitons, like replication to mobile devices or laptops or whatever, AP is almost always the right choice. the problem with trying to apply the CAP theorem to the real world is that the CAP theorem's definition of availability is not the same as most people's definition of availability in practice.

> in systems designed for network partitons, like replication to mobile devices or laptops or whatever, AP is almost always the right choice.

Although it is pretty much the only choice for general purpose file sync, it's still not a good choice. It is difficult to train non-technical staff to deal with inconsistencies on resync, and they don't want to have to deal with it. I've had success with Synctus precisely because it guarantees consistency (it is CP, and any one node keeps the A for a given file). Of course, this only works for mostly-on systems.

I take this to mean that if the nodes are disconnected from each other, Synctus disallows all access to certain files on node A; i.e. the files that node B currently owns.

Do I understand correctly?

Yes, that's right. Although currently it provides read-only access if a replica (not necessarily known to be the latest) is available, so I suppose that's not quite fully C in the read-only case. In the future, this might be a configurable option.

yes that's certainly true. if there isn't a sensible option for a merge policy and non technical users have to resolve it manually you're fighting a losing battle in an AP system.

I would disagree with the central thesis. You can sacrifice P for a weaker version: assume a network which "eventually heals": any live node will answer at least one message in a hundred, say [any node which does not is assumed to be unavailable]. The alternative to P is not "perfect network", it's "bounded from below on the badness thereof network", a significantly more realistic beast.

An update: Eric Brewer, who originally posited the CAP theorem, endorses this article: http://twitter.com/eric_brewer/status/26819094612

I don't agree. For instance Redis Cluster will be consistent (under the limit of physics) and not partition tolerant. But why this requires a network that will never have troubles? Simply when the network will be broken the cluster will not work at all.

What Redis Cluster will guarantee is that you can have M-1 nodes, with M being the number of replicas per "hash slot", that can go down, and/or get partitioned.

So this is a form of "weak" tolerance to partition, where at least a given percentage of the nodes must remain up and able to talk to each other.

But in the practice this is how most networks work. Single computers fail, and Redis Cluster will be still up. Single computers (or up to M-1) can experience networking problems, and Redis will continue to work.

In the unlikely condition that the network is split in two halves the cluster will start replying with an error to the clients.

This means that the sys admins have to design the network so that it is unlikely that there are strange split patterns, like A and B can talk to C that can tolk to D but blablabal... in high performance network with everything well-cabled and without complex routing this should not be a problem, IMHO.

In your first paragraph's example ("Simply when the network will be broken the cluster will not work at all.") you are sacrificing availability.

In the rest of your post, you seem to be sacrificing consistency; one server is down, and thus not receiving any updates from the other servers when data gets updated.

I'm not sure you understood the point of the article, so I'll try to restate it: When part of your system goes down (and it will), you can choose between refusing requests, in which case you sacrifice availability, or serving requests, in which case you sacrifice consistency, since the part of the system which is down cannot be updated when you update data, or cannot be queried in the case of data which is insufficiently replicated. You _cannot_ choose both, since that would require communicating with the downed server.

why do you think my servers are interconnected? I think your conclusions are broken because of many non-always-true assumptions.

In Redis Cluster there is no cluster data communication if not for resharding that only works when the whole cluster is on and is done by the sys administrator when adding a node.

So in normal conditions, a node will either:

1) Accept a query, or

2) Tell the client: no, ask instead

All the nodes are connected only to make sure the state of the cluster is up. If there are too much nodes down from the point of view of a single node it will reply to the client with a cluster error.

What I'm sacrificing is only consistency because in every given time there is only a single host that is getting the queries for a given subset of keys.

The exception is in the resharding case that is also fault-tolerant. Or slave election (fault tolerance is obtained via replicas).

As a side note, the clients should cache what node is responsible for a given set of keys, so after some time and when there are no failures/resharding in act, every client will directly ask the right node, making the solution completely horizontally scalable.

Dummy clients will just do always the ask-random-node + retry stage if they are unable to take state.

Edit: there are little fields like this that are totally in the hands of academia. My contribution is from the point of view of a dummy hacker that can't understand complex math but that will try to be much more pragmatic.

"If there are too much nodes down from the point of view of a single node it will reply to the client with a cluster error."

This sacrifices availability. Remember, the cluster doesn't include only the servers; it also includes the clients, since ultimately the point of a database server is to provide the data to the clients upon request.

sure, my tradeoffs are clear, I sacrifice availability in every part of the net where less than M-1 nodes appears to be down in order to win: 1) consistency. 2) latency.

What I did was to stress the tradeoffs that my data model was forcing itself, as Redis handles complex aggregate data and an eventual consistent solutions sucks in this context.

So Redis Cluster users will be a fast scalable consistent solution that will start trowing errors if the network will go down badly, but that will survive if a few nodes will go bad or if there are small network problems affecting a small number of nodes. If this sounds too little available please explain me this:

We have a network with 10 web servers and 10 DB nodes.

The netsplit will split 8 nodes from all the rest, so 10 web servers will be able to talk with 2 nodes.

I wonder how this two nodes will be able to handle all the traffic usually handled by 10 nodes. Netsplit tolerance is a myth if you don't specify very very very well under what conditions.

Re: "In the rest of your post, you seem to be sacrificing consistency; one server is down, and thus not receiving any updates from the other servers when data gets updated."

See: http://news.ycombinator.com/item?id=1768719

Partitions happen, often due to human error. Asymmetric communication scenarios are not exotic at all. It's very common to see them when some link is saturated in one direction. This requires no complex routing, and I've seen it from something as simple as a badly configured backup script.

Also, beware of confusing your local failure detector for a global one. The emergent behavior is a pain when you don't take this into account.

This blog post that I wrote a few months ago also explains the same issue, and may be of interest for those looking for a separate explanation:


For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.

For a CA system, any node which is unable to assure global consistency reports itself as failed. It will neither return bogus results, nor hang indefinitely.

In asynchronous networks it is surprisingly hard to detect failures, even of yourself.

Reporting an error condition counts as an availability violation.

In asynchronous networks it is surprisingly hard to detect failures, even of yourself.

The reason failures are hard to detect in asynchronous networks is that permissible message transit times are unbounded; ie they refuse to acknowledge the presence of any partition. If you acknowledge the possibility of partitions, then your system is by definition not asynchronous.

Reporting an error condition counts as an availability violation.

This is bullshit. Per the definition quoted in the linked article, availability only means that "...every request must terminate.". It is not required that it terminate successfully.

"The reason failures are hard to detect in asynchronous networks is that permissible message transit times are unbounded; ie they refuse to acknowledge the presence of any partition. If you acknowledge the possibility of partitions, then your system is by definition not asynchronous."

No. Like you say, async means failures are hard to distinguish from delays. If a node's NIC sets on fire, I'm pretty sure no messages are ever going to get delivered to it - hence it is partitioned from the network. It is very hard to tell whether it has failed, or whether it is just running slowly, in an async network.

"This is bullshit. Per the definition quoted in the linked article, availability only means that "...every request must terminate.". It is not required that it terminate successfully."

No. The definition of the atomic object modelled by the service doesn't include an 'error' condition. Otherwise I could make a 100% available, 100% consistent system by always returning the error state, which is thoroughly uninteresting. You have to read more than the quoted definition in the Gilbert and Lynch paper to start calling bs - it is very clear that authors do not allow an 'error' response.

The definition of the atomic object modelled by the service doesn't include an 'error' condition.

A curious inconsistency.

But still... ok, the bullshit is elsewhere.

For a distributed (i.e., multi-node) system to not require partition-tolerance it would have to run on a network which is guaranteed to never drop messages (or even deliver them late) and whose nodes are guaranteed to never die. You and I do not work with these types of systems because they don’t exist.

This is bullshit; it assumes that the only options are complete reliance on absence of failures, and tolerance of arbitrary partitions. Specifically, it claims that P(system fails) = 1 - (1 - P(any particular node fails)) ^ (number of nodes), and that therefore "the question you should be asking yourself is: In the event of failures, which will this system sacrifice? Consistency or availability?". There are plenty of real-life counter-examples to this; given a real-world (ie, at least partially synchronous) network, it is possible to maintain consistency and availability in the face of partition/failure of up to half-minus-one of your nodes. This blows the probability calculations completely out of the water.

You cannot, however, choose both consistency and availability in a distributed system. ... As a thought experiment, imagine a distributed system which keeps track of a single piece of data using three nodes—A, B, and C—and which claims to be both consistent and available in the face of network partitions.

Hey, see how that "in the face of network partitions" snuck in there? It's bullshit, you want "in the face of these specific kinds of network partitions", things like crashed nodes. Just enough to invalidate that abuse of statistics used to claim that Availability is impossible.

"A curious inconsistency."

I don't agree - availability is a totally meaningless property if you are allowed to occasionally return "no, I won't process your request". Such a response communicates nothing about the state of the atomic object you are writing to or reading from, so you can always return it and trivially satisfy 'availability' if we define it this way.

To your other point - be aware that I didn't write the article, so I'm not speaking for the author. However, I think you're right that the article makes it sound a bit like a single failure or message loss will cause any protocol to immediately sacrifice availability or consistency. This isn't the case - all CAP does is establish that for every protocol, there exists a failure pattern that will force it to abandon one of the two.

For quorum systems, this means that a permanent partition causes one half of the partition to no longer be the majority, and therefore can no longer be consistent if it responds to any requests. Paxos is another example.

So you're right, there are particular patterns of partition that stop a protocol from functioning correctly. And many that don't - hence the term 'fault tolerant' has some meaning.

Avoiding these patterns, in practice, can turn out to be surprisingly tricky. High-performance systems can't afford to have too many participants, which means that the probability of a problematic failure is higher than we might like (five participants in a consensus protocol is already a lot for high throughput, but now we are susceptible to only three failures). Failures are also often correlated, so independence assumptions don't hold as much as would like. Machines crash. Networking gear fails.

There's no abuse of statistics here. The probability of a particular failure pattern can be engineered low, and at that point you must weigh the trade-offs of the cost of loss of availability / consistency vs. the effort you make to minimise the chance of occurrence. We are talking about edge cases here, and the implicit assumption is that the cost of hitting one of them is huge (and it often is). However if you run a cluster large enough, you hit edge cases all the time.

(Although you mention partial synchrony, note that most of these results are mainly applicable to asynchronous networks in the first instance).

I don't agree - availability is a totally meaningless property if you are allowed to occasionally return "no, I won't process your request". Such a response communicates nothing about the state of the atomic object you are writing to or reading from, so you can always return it and trivially satisfy 'availability' if we define it this way.

Availability of the individual node, or of the service as a whole? So long as the nodes always answer quickly, can't I just ask a few different ones until I get a successful response (or conclude that the entire system is down)?

The system may be unable to give you a consistent response, no matter who you ask. It really depends on how you build your protocol.

Let's imagine a system where you want to be 100% available for reads, for any number of failures less than N. Then you need to be able to submit every single write to every single node in the system, otherwise the failure of all but the up-to-date node will result in stale reads.

But then if a single node is partitioned from the network, we can't (correctly) be available for writes, because the system is incapable of sending updates to all reads as required. It doesn't matter which node you ask.

The point is that every system has a failure mode like this. I take your point that it's not always just a single node failure that precipitates the abandonment of C or A, but that was never the point of the CAP theorem.

A CA system is simply one which is not available at all during a network partition, since it is partition-intolerant.

This lack of availability is different from the availability in the A of CAP, since that availability holds only so long as the network is not partitioned (by definition in a CA system).

Such a system might not be considered a distributed system at all (although it may still be distributing load), since a partition-intolerant system is effectively one system as far as the CAP theorem is concerned.

So it's essentially a special case of the CAP theorem, but it is still useful to describe it as CA.

No, it's exactly the same. Availability is a guarantee that all requests are eventually responded to within some time bound, whatever that is. During the partition, availability is violated.

Therefore it's not a CA system, but a C system.

Are you sure? Availability in the CAP theorem is a state, as are (P)artition and (C)onsistency. Your system can't be simultaneously consistent and available in the presence of a network partition. The A in CAP doesn't mean always available. It just means the system can, at best, be any two of the three at a time.

No, it does mean always available - honestly :)

If there is some time period during which requests are not responded to within a time bound, the system is not available then, and further is not a 'highly' or 100% available system. That is what the CAP theorem is talking about.

Consistency, similarly, is not a state but a property that holds across all responses. Either you return a consistent response to all your requests, or you don't. In the context of CAP, there is no middle ground.

I guess I'm missing something because the concept of quorum deals with partition tolerance. You require, to provide an answer, that X nodes agree on the state of the data.

3 nodes, 2 must agree. When the partition heals, the resolution process happens. It would have to be a SERIOUSLY bad network design and quorum setting that allows a quorum on both sides of the split.

It's just like eventual consistency. We're not talking days or even minutes. We're talking milliseconds/seconds of partition split. If you have a partition split for days, you have OTHER issues to address.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact