

You Can't Sacrifice Partition Tolerance - codahale
http://codahale.com/you-cant-sacrifice-partition-tolerance/

======
dadkins
I think the author, like many recently exposed to the CAP theorem, is confused
about the meaning of partition tolerance, leading to ridiculous conclusions.

Partition tolerance does not mean your distributed system can't be consistent
and available because your network dropped one packet, or one node failed.
What would be the point of such a definition? Instead, the CAP theorem implies
that _while_ the network is partitioned, consistency or availability must be
sacrificed. In the case of the dropped packet, once it is retransmitted the
partition is healed and progress can be made. Or in the case of the failed
node, nothing says that the _rest_ of the system can't be consistent and
available, so that the system as a whole maintains that property. There is no
requirement that the unavailable node be available.

Truly partition tolerant systems are those that continue to function in the
face of a prolonged partition, and those are the systems that must sacrifice
either consistency or availability.

~~~
allertonm
What you're saying is that so long as there are no partitions, the system can
be Consistent and Available, but if there's a partition, it can't.

"Consistent sometimes" is not the same thing as "Consistent" and "Available
sometimes" is not the same thing as "Available" - and so "Consistent and
Available sometimes" is not the same as "Consistent and Available".

I believe you might be guilty of confusing "Eventual Consistency" with
"Consistency".

Funnily enough, no-one has found much use for "Eventual Availability" so far.

~~~
dadkins
Not quite. What I'm saying is that a dropped packet or a failed node are _not_
partitions as far as the CAP theorem is considered.

A distributed system is considered available if "every request received by a
non-failing node [results] in a response." It does not mean you cannot
retransmit or retry.

Similarly, the consistency guarantee only requires that there exist a total
order on operations. Failures are ok, as we're allowed to retransmit, retry,
and otherwise tolerate faults. There is no inconsistency, nor is anything
eventual.

My point is that a "temporary" partition is just a fault, and as long as the
fault is shorter than the allowed response time of the system, it doesn't make
a difference.

~~~
HenryR
No, dropped packets _are_ partitions. They really are. A partitionable network
is modelled as one which may fail to deliver any subset of sent messages
between nodes. The Gilbert and Lynch paper makes this explicit.

The consistency guarantee requires that RW histories are compatible with some
sequentially consistent history on a non-concurrent RW register. Defining a
total order on operations is sufficient, I believe, but not necessary (does it
matter what order two consecutive reads happened in?).

~~~
dadkins
How do you explain Paxos, then? How does a dropped packet prevent the system
from responding to queries? How about if I broadcast every response 10 times
to everyone I know? How many packets must be dropped for the system to be
considered unavailable?

~~~
HenryR
Depends on the protocol, in general.

Paxos is, fundamentally, a quorum-based system that deals with reordering of
messages. It sacrifices liveness for correctness - if the proposer does not
hear back from a majority of nodes (in the case of, e.g. a partition), the
protocol will not complete (availability is sacrificed).

My point is not that there is a 'vital packet' in every protocol, the omission
of which will cause either a lack of availability or consistency (although I
can certainly design protocols that way!) - it's that for every protocol there
is a network partition which causes it to be either unavailable or
inconsistent. That network partition might be dropping ten messages, or just
one. Retransmitting would make sense, but in real life message failures are
often highly temporally correlated :(

The proof of this, by the way, is in a very famous paper by Fischer, Lynch and
Patterson called "The Impossibility of Distributed Consensus With One Faulty
Process". One take away is that one slow-running process can take down any
protocol. It may take a few missed messages, but only a single node...

~~~
ithkuil
CAPL: consistency, availability, partition tolerance, latency

Paxos sacrifices latency

~~~
kaib
Incorrect, paxos sacrifices availability. Paxos is consistent but does not
necessarily ever terminate.

------
ericflo
This is the most clear and well-written summary that I've seen of the
tradeoffs presented by the CAP theorem. Hopefully it clears up a lot of the
confusion out there.

I think this is the tweet that prompted the post:
<http://twitter.com/JamesMPhillips/status/26502076366>

------
rcoder
I think the yield/harvest concept described here is one of the more useful
models I've heard about in a while for thinking about fault-tolerance
tradeoffs. My thanks to @codahale for the write-up, and particularly for the
references.

------
jeffffff
the best way i've heard it phrased is 'given the presence of a network
partition, you must choose whether to maintain consistency or availability'.
this does not mean that any network partition will make data unavailable if
you choose C. it only means that some network partitions will make some data
unavailable to some machines. picking A does not guarantee that all data will
be available to all machines in the presence of a partition either. given a CP
system with 3 way replication requiring a quorum to make progress, i would
argue that the set of partitions in which data becomes unavailable yet would
still be available had AP been chosen is very small and not worth worrying
about. in systems designed to be up 100% of the time where partitions are the
exception rather than the norm CP is almost always the right choice. in
systems designed for network partitons, like replication to mobile devices or
laptops or whatever, AP is almost always the right choice. the problem with
trying to apply the CAP theorem to the real world is that the CAP theorem's
definition of availability is not the same as most people's definition of
availability in practice.

~~~
rlpb
> in systems designed for network partitons, like replication to mobile
> devices or laptops or whatever, AP is almost always the right choice.

Although it is pretty much the only choice for general purpose file sync, it's
still not a good choice. It is difficult to train non-technical staff to deal
with inconsistencies on resync, and they don't want to have to deal with it.
I've had success with Synctus precisely because it guarantees consistency (it
is CP, and any one node keeps the A for a given file). Of course, this only
works for mostly-on systems.

~~~
Deestan
I take this to mean that if the nodes are disconnected from each other,
Synctus disallows _all_ access to certain files on node A; i.e. the files that
node B currently owns.

Do I understand correctly?

~~~
rlpb
Yes, that's right. Although currently it provides read-only access if a
replica (not necessarily known to be the latest) is available, so I suppose
that's not quite fully C in the read-only case. In the future, this might be a
configurable option.

------
moshezadka
I would disagree with the central thesis. You can sacrifice P for a weaker
version: assume a network which "eventually heals": any live node will answer
at least one message in a hundred, say [any node which does not is assumed to
be unavailable]. The alternative to P is not "perfect network", it's "bounded
from below on the badness thereof network", a significantly more realistic
beast.

------
ericflo
An update: Eric Brewer, who originally posited the CAP theorem, endorses this
article: <http://twitter.com/eric_brewer/status/26819094612>

------
antirez
I don't agree. For instance Redis Cluster will be consistent (under the limit
of physics) and _not_ partition tolerant. But why this requires a network that
will never have troubles? Simply when the network will be broken the cluster
will not work at all.

What Redis Cluster will guarantee is that you can have M-1 nodes, with M being
the number of replicas per "hash slot", that can go down, and/or get
partitioned.

So this is a form of "weak" tolerance to partition, where at least a given
percentage of the nodes must remain up and able to talk to each other.

But in the practice this is how most networks work. Single computers fail, and
Redis Cluster will be still up. Single computers (or up to M-1) can experience
networking problems, and Redis will continue to work.

In the unlikely condition that the network is split in two halves the cluster
will start replying with an error to the clients.

This means that the sys admins have to design the network so that it is
unlikely that there are strange split patterns, like A and B can talk to C
that can tolk to D but blablabal... in high performance network with
everything well-cabled and without complex routing this should not be a
problem, IMHO.

~~~
inklesspen
In your first paragraph's example ("Simply when the network will be broken the
cluster will not work at all.") you are sacrificing availability.

In the rest of your post, you seem to be sacrificing consistency; one server
is down, and thus not receiving any updates from the other servers when data
gets updated.

I'm not sure you understood the point of the article, so I'll try to restate
it: When part of your system goes down (and it will), you can choose between
refusing requests, in which case you sacrifice availability, or serving
requests, in which case you sacrifice consistency, since the part of the
system which is down cannot be updated when you update data, or cannot be
queried in the case of data which is insufficiently replicated. You _cannot_
choose both, since that would require communicating with the downed server.

~~~
antirez
why do you think my servers are interconnected? I think your conclusions are
broken because of many non-always-true assumptions.

In Redis Cluster there is no cluster _data_ communication if not for
resharding that only works when the whole cluster is on and is done by the sys
administrator when adding a node.

So in normal conditions, a node will either:

1) Accept a query, or

2) Tell the client: no, ask instead 1.2.3.4:6380

All the nodes are connected only to make sure the state of the cluster is up.
If there are too much nodes down from the point of view of a single node it
will reply to the client with a cluster error.

What I'm sacrificing is only consistency because in every given time there is
only a single host that is getting the queries for a given subset of keys.

The exception is in the resharding case that is also fault-tolerant. Or slave
election (fault tolerance is obtained via replicas).

As a side note, the clients should cache what node is responsible for a given
set of keys, so after some time and when there are no failures/resharding in
act, every client will directly ask the right node, making the solution
completely horizontally scalable.

Dummy clients will just do always the ask-random-node + retry stage if they
are unable to take state.

Edit: there are little fields like this that are totally in the hands of
academia. My contribution is from the point of view of a dummy hacker that
can't understand complex math but that will try to be much more pragmatic.

~~~
inklesspen
"If there are too much nodes down from the point of view of a single node it
will reply to the client with a cluster error."

This sacrifices availability. Remember, the cluster doesn't include only the
servers; it also includes the clients, since ultimately the point of a
database server is to provide the data to the clients upon request.

~~~
antirez
sure, my tradeoffs are clear, I sacrifice availability in every part of the
net where less than M-1 nodes appears to be down in order to win: 1)
consistency. 2) latency.

What I did was to stress the tradeoffs that my data model was forcing itself,
as Redis handles complex aggregate data and an eventual consistent solutions
sucks in this context.

So Redis Cluster users will be a fast scalable consistent solution that will
start trowing errors if the network will go down badly, but that will survive
if a few nodes will go bad or if there are small network problems affecting a
small number of nodes. If this sounds too little available please explain me
this:

We have a network with 10 web servers and 10 DB nodes.

The netsplit will split 8 nodes from all the rest, so 10 web servers will be
able to talk with 2 nodes.

I wonder how this two nodes will be able to handle all the traffic usually
handled by 10 nodes. Netsplit tolerance is a myth if you don't specify very
very very well under what conditions.

------
HenryR
This blog post that I wrote a few months ago also explains the same issue, and
may be of interest for those looking for a separate explanation:

[http://www.cloudera.com/blog/2010/04/cap-confusion-
problems-...](http://www.cloudera.com/blog/2010/04/cap-confusion-problems-
with-partition-tolerance/)

------
tbrownaw
_For a distributed system to be continuously available, every request received
by a non-failing node in the system must result in a response._

For a CA system, any node which is unable to assure global consistency reports
itself as failed. It will neither return bogus results, nor hang indefinitely.

~~~
HenryR
In asynchronous networks it is surprisingly hard to detect failures, even of
yourself.

Reporting an error condition counts as an availability violation.

~~~
tbrownaw
_In asynchronous networks it is surprisingly hard to detect failures, even of
yourself._

The reason failures are hard to detect in asynchronous networks is that
permissible message transit times are unbounded; ie they refuse to acknowledge
the presence of any partition. If you acknowledge the possibility of
partitions, then your system is by definition not asynchronous.

 _Reporting an error condition counts as an availability violation._

This is bullshit. Per the definition quoted in the linked article,
availability only means that " _...every request must terminate._ ". It is not
required that it terminate _successfully_.

~~~
HenryR
"The reason failures are hard to detect in asynchronous networks is that
permissible message transit times are unbounded; ie they refuse to acknowledge
the presence of any partition. If you acknowledge the possibility of
partitions, then your system is by definition not asynchronous."

No. Like you say, async means failures are hard to distinguish from delays. If
a node's NIC sets on fire, I'm pretty sure no messages are ever going to get
delivered to it - hence it is partitioned from the network. It is very hard to
tell whether it has failed, or whether it is just running slowly, in an async
network.

"This is bullshit. Per the definition quoted in the linked article,
availability only means that "...every request must terminate.". It is not
required that it terminate successfully."

No. The definition of the atomic object modelled by the service doesn't
include an 'error' condition. Otherwise I could make a 100% available, 100%
consistent system by always returning the error state, which is thoroughly
uninteresting. You have to read more than the quoted definition in the Gilbert
and Lynch paper to start calling bs - it is very clear that authors do not
allow an 'error' response.

~~~
tbrownaw
_The definition of the atomic object modelled by the service doesn't include
an 'error' condition._

A curious inconsistency.

But still... ok, the bullshit is elsewhere.

 _For a distributed (i.e., multi-node) system to not require partition-
tolerance it would have to run on a network which is guaranteed to never drop
messages (or even deliver them late) and whose nodes are guaranteed to never
die. You and I do not work with these types of systems because they don’t
exist._

This is bullshit; it assumes that the only options are _complete_ reliance on
absence of failures, and tolerance of arbitrary partitions. Specifically, it
claims that P(system fails) = 1 - (1 - P(any particular node fails)) ^ (number
of nodes), and that therefore "the question you should be asking yourself is:
In the event of failures, which will this system sacrifice? Consistency or
availability?". There are plenty of real-life counter-examples to this; given
a real-world (ie, at least partially synchronous) network, it is possible to
maintain consistency and availability in the face of partition/failure of up
to half-minus-one of your nodes. This blows the probability calculations
completely out of the water.

 _You cannot, however, choose both consistency and availability in a
distributed system. ... As a thought experiment, imagine a distributed system
which keeps track of a single piece of data using three nodes—A, B, and C—and
which claims to be both consistent and available in the face of network
partitions._

Hey, see how that "in the face of network partitions" snuck in there? It's
bullshit, you want "in the face of _these specific kinds of_ network
partitions", things like crashed nodes. Just enough to invalidate that abuse
of statistics used to claim that Availability is impossible.

~~~
HenryR
"A curious inconsistency."

I don't agree - availability is a totally meaningless property if you are
allowed to occasionally return "no, I won't process your request". Such a
response communicates nothing about the state of the atomic object you are
writing to or reading from, so you can always return it and trivially satisfy
'availability' if we define it this way.

To your other point - be aware that I didn't write the article, so I'm not
speaking for the author. However, I think you're right that the article makes
it sound a bit like a single failure or message loss will cause any protocol
to immediately sacrifice availability or consistency. This isn't the case -
all CAP does is establish that for every protocol, there exists a failure
pattern that will force it to abandon one of the two.

For quorum systems, this means that a permanent partition causes one half of
the partition to no longer be the majority, and therefore can no longer be
consistent if it responds to any requests. Paxos is another example.

So you're right, there are particular _patterns_ of partition that stop a
protocol from functioning correctly. And many that don't - hence the term
'fault tolerant' has some meaning.

Avoiding these patterns, in practice, can turn out to be surprisingly tricky.
High-performance systems can't afford to have too many participants, which
means that the probability of a problematic failure is higher than we might
like (five participants in a consensus protocol is already a lot for high
throughput, but now we are susceptible to only three failures). Failures are
also often correlated, so independence assumptions don't hold as much as would
like. Machines crash. Networking gear fails.

There's no abuse of statistics here. The probability of a particular failure
pattern can be engineered low, and at that point you must weigh the trade-offs
of the cost of loss of availability / consistency vs. the effort you make to
minimise the chance of occurrence. We are talking about edge cases here, and
the implicit assumption is that the cost of hitting one of them is huge (and
it often is). However if you run a cluster large enough, you hit edge cases
all the time.

(Although you mention partial synchrony, note that most of these results are
mainly applicable to asynchronous networks in the first instance).

~~~
tbrownaw
_I don't agree - availability is a totally meaningless property if you are
allowed to occasionally return "no, I won't process your request". Such a
response communicates nothing about the state of the atomic object you are
writing to or reading from, so you can always return it and trivially satisfy
'availability' if we define it this way._

Availability of the individual node, or of the service as a whole? So long as
the nodes always answer quickly, can't I just ask a few different ones until I
get a successful response (or conclude that the entire system is down)?

~~~
HenryR
The system may be unable to give you a consistent response, no matter who you
ask. It really depends on how you build your protocol.

Let's imagine a system where you want to be 100% available for reads, for any
number of failures less than N. Then you need to be able to submit every
single write to every single node in the system, otherwise the failure of all
but the up-to-date node will result in stale reads.

But then if a single node is partitioned from the network, we can't
(correctly) be available for writes, because the system is incapable of
sending updates to all reads as required. It doesn't matter which node you
ask.

The point is that every system has a failure mode like this. I take your point
that it's not always just a single node failure that precipitates the
abandonment of C or A, but that was never the point of the CAP theorem.

------
rlpb
A CA system is simply one which is not available at all during a network
partition, since it is partition-intolerant.

This lack of availability is _different_ from the availability in the A of
CAP, since that availability holds only so long as the network is not
partitioned (by definition in a CA system).

Such a system might not be considered a distributed system at all (although it
may still be distributing load), since a partition-intolerant system is
effectively one system as far as the CAP theorem is concerned.

So it's essentially a special case of the CAP theorem, but it is still useful
to describe it as CA.

~~~
HenryR
No, it's exactly the same. Availability is a guarantee that all requests are
eventually responded to within some time bound, whatever that is. During the
partition, availability is violated.

Therefore it's not a CA system, but a C system.

~~~
dadkins
Are you sure? Availability in the CAP theorem is a state, as are (P)artition
and (C)onsistency. Your system can't be simultaneously consistent and
available in the presence of a network partition. The A in CAP doesn't mean
always available. It just means the system can, at best, be any two of the
three at a time.

~~~
HenryR
No, it does mean always available - honestly :)

If there is some time period during which requests are not responded to within
a time bound, the system is not available then, and further is not a 'highly'
or 100% available system. That is what the CAP theorem is talking about.

Consistency, similarly, is not a state but a property that holds across all
responses. Either you return a consistent response to all your requests, or
you don't. In the context of CAP, there is no middle ground.

------
lusis
I guess I'm missing something because the concept of quorum deals with
partition tolerance. You require, to provide an answer, that X nodes agree on
the state of the data.

3 nodes, 2 must agree. When the partition heals, the resolution process
happens. It would have to be a SERIOUSLY bad network design and quorum setting
that allows a quorum on both sides of the split.

It's just like eventual consistency. We're not talking days or even minutes.
We're talking milliseconds/seconds of partition split. If you have a partition
split for days, you have OTHER issues to address.

