The CAP FAQ

roganp · on Oct 18, 2015

What I have often wondered, is how are "distributed" systems different from systems in the small where different components may be thought of as nodes on a network. In other words, why are not multiple threaded apps writing to the same database subject to the same constraints if the components of the system are not provably 100% reliable? Is it just a practical matter, that communication between a host and external storage is just so much more reliable than communication across larger distances that we don't have to worry about it?

themartorana · on Oct 18, 2015

The only difference really is single-point-of-failure. Multiple threads writing to the same DB record can be assisted with transactions or locking - but again, the DB is, in the end, absolutely authoritative. What it says goes, and if it's unavailable, so is everything.

When solving for high Availability through a distributed cluster, you might give up immediate Consistency. But partitions can end up giving different answers (authoritatively, even) and an unavailable master record doesn't mean something won't answer as if it _were_ authoritative.

A single node DB - like a MySQL server - still has failure points, but none that plague CAP systems (well, except Availability, but that's how we got into this mess to begin with).

ff_ · on Oct 18, 2015

The thing that really struck me is that it has been proven only in 2002 - and it is by now a fundamental idea in (my) CS education about Databases and Distributed Systems: I'm used to considering its implications when thinking about a problem.

So I wonder: how did you did it before then?

pron · on Oct 18, 2015

Two things.

1. The notion of "high availability" wasn't used much before "web scale" applications.

2. As others have said, the CAP theorem is kind of tautological and does more to confuse about what's possible than to clarify. For example, the CAP theorem implies that even eventual consistency is impossible in the case of partitions. If partitions can be of infinite length then no kind of consistency is possible; if partitions are finite, then all kinds of consistency are possible (with "availability" meaning very high latency, though). In short, the CAP theorem itself doesn't contribute much information to the analysis of distribution problems.

I recommend reading this nice overview of the issues with CAP: http://arxiv.org/abs/1509.05393

notacoward · on Oct 18, 2015

"High availability" was used very often long before "web scale" anything came along. It was in the very name of HACMP/6000, which I worked on in 1992, and wasn't new then. The problem is that "availability" in that world meant something very different than "availability" as it's defined in CAP - system availability vs. node availability. That has caused almost as much confusion as the difference in meaning between "consistency" in CAP and "consistency" in ACID.

al2o3cr · on Oct 18, 2015

Before then, people built distributed systems and (sometimes) assumed they didn't have failure modes.

Now, people build distributed systems and assume they don't have failure modes AND THEN post about how they've "refuted the CAP theorem". :)

ownagefool · on Oct 18, 2015

The same as anything really. You try and write a system, you realise that logically it was harder than you first assumed, you note there are failure scenarios that you're aware of. You mitigate and ship.

Later as the bug reports come tumbling in you realise there were failure scenarios that you weren't aware of, usually because you were relying on something that you cant actually rely on.

The fact that this is formalised is great because it means people who don't talk about when their systems don't work aren't giving you the whole picture.

lsd5you · on Oct 18, 2015

I would say because ultimately it is just a shorthand/buzzword/brand. The theorom itself is kind of tautological, but it provides a simple starting point for talking about trade-offs.

Unfortunately and inevitably things are a bit more complicated than a model simplified to 3 words so it has also generated its fair share of confusion.

ownagefool · on Oct 18, 2015

I don't think you can blame the theorom for the confusion. It's the trade-offs that are murky and complicated, CAP just tells us that there's a need to care about the dirty details.

Anyway being able to use CAP to beat down the developer who wants to use the new webscale shiny that doesn't even begin to discuss or document tradeoffs is a net win in my opinion.

shin_lao · on Oct 18, 2015

The proof is questionable. Read here: http://markburgess.org/blog_cap.html

jtheory · on Oct 18, 2015

This is well-explained and fairly concise.

It's a concept that could be much more rapidly explained in an intuitive way with a few interactive diagrams, or honestly a few animated gifs, though.

The CAP theorem is one of those things that's a simple idea, but I didn't understand it the first few times I ran across it just because:

1) the terminology is weird ("partition", huh... well, I know how to partition a disk, but this clearly isn't that)

2) broad concepts need good examples for newcomers -- to bridge the gap from familiar to unfamiliar. If you've worked with large scale distributed processing before, great, but these ideas are especially crucial for people who are going to need such systems but don't yet understand the tradeoffs.

shin_lao · on Oct 18, 2015

I think CAP is really more about how much consistency and availablity you need. This FAQ talks about "relaxing" and that's the right word.

Some systems tolerate that a database (for example) is unavailable for a couple of seconds. Some don't.

When you add a value into a distributed database, how much time are you ready to wait for replicas to propagate? In some cases it's fine to wait several minutes (think DNS) in some other it can have catastrophic results (think finance/banking use cases).

pjc50 · on Oct 18, 2015

Ironically finance has coped just fine with delayed-update replicas for centuries; that's what cheque clearing is. You issue a 'write' on a piece of paper, speculatively commit a transaction locally, but in the event of a write collision at head office it gets 'bounced' and the transaction rolled back.

jitix · on Oct 18, 2015

Looking at this article IMO the best example of using CAP is Cassandra. The user chooses the tradeoff.

themartorana · on Oct 18, 2015

Most times the tradeoff has been chosen before the database is.