

Problems with CAP, and Yahoo’s little known NoSQL system - allertonm
http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html

======
allertonm
Attacking CAP seems to be a recent theme - Stonebraker did this a few weeks
back ([http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-
database...](http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-database-
systems-eventual-consistency-and-the-cap-theorem/fulltext))

Abadi's take seemed more interesting to me since it isn't rubbishing CAP as
such, but noting that C, A & P are not symmetrical and also that there is a
fourth consideration - latency - that most discussions of this topic do not
take into account even though it is clearly a design factor in some existing
NoSQL databases.

~~~
strlen
I think latency is the un-written, default assumption in all of these cases.
High-latency, strongly consistent, shared nothing distributed relational
databases are over two decades old (starting with Stonebraker's "a case for
shared nothing"). Problem is that they're unusable for OLTP workloads (without
non-commodity hardware appliances, using SSDs and connected over Infiband
e.g., Exadata) even _if_ you don't "CAP" into account.

Latency isn't a "tunable" variable in the same sense as C and P, in that
there's multiple ways to lower latency. With eventual consistency you can use
smaller quorums to lower the latency for either reads, writes or both. You can
also use strong consistency, to avoid having to do quorums at _read times_ but
at higher latency writes. Higher latency writes may be acceptable, at least
with 2PC. I am not sure many systems use Paxos for replicating a transaction
log on _every write_ due to the costs: I believe Scalaris may be, but I can't
find the information right now.

------
sophacles
It seems that he ignores a certain necessity in his CA == CP premise.
Specifically, that premise is that to achieve CA the system must be a single
logical unit. One the network partitions, anyone in the correct partition (the
same one as the CA system) still has access, therefore the system is still
available. This sounds bunk, because the system is not available to those on
the wrong partition, however, this is not an availability issue, it is a
partition issue -- pretty much by definition.

------
kingkilr
Wish my DB systems course even covered this stuff. It's an interesting take on
it, however in practice Cassandra does have a strong consistency guarantee
when you don't have a network partition, so I think the point he tries to make
about C v A tradeoffs is weaker in light of this.

~~~
strlen
The popular misconception about eventually consistency systems is that
eventual consistency means you don't get to read-your-writes. In reality,
however, eventual consistency means eventual consistency of individual
replicas: you can still get a consistent _view_ of the key space even if (for
a very small point in time) the individual replicas are inconsistent.

You can use quorum protocols and version vectors to get _guaranteed_ read-
your-writes consistency (when R + W > N). The cost is higher latency (for the
operations that require the quorums) and loss of availability during a network
partition (without additional heuristics such as "preferred R" or "preferred
W").

Even with _weak_ eventual consistency (when R + W < N), in a Dynamo-like
system not being able to read-your-writes is still only a failure condition
(when the first W nodes in the preference list for a key fail).

With traditional MySQL or Postgres/Slony replication between master/slave or
master/master your guarantees are actually _much lower_. You're not guaranteed
to be able to read your writes ( _if_ you use transactions you're guaranteed
W=1, else you're guaranteed W=0). Such a system is also "likely consistent"
rather than "eventually consistency": there is no guarantee of that
consistency will eventually take place.

One thing to note is that PNUTs takes a hybrid approach: there's strong
consistency (2PC) on all the replicas within a "datacenter" (my unsupported
assertion is that this actually means "on all replicas on a single core
switch" rather than a datacenter per-se), but eventual consistency between
"datacenters". This means that there's no need to do quorum reads or use
preference lists in a LAN environment: you can go to _any_ replica and get the
right value back, meaning you get "read-your-writes" consistency with lower
latency. Developers also don't need to worry about sending vector clocks along
during each put, it's easier to implementation mutations of high-contention
keys (in an eventually consistency system this would have be done using
"optimistic locking" and vector clocks, e.g. < [http://project-
voldemort.com/javadoc/all/voldemort/client/St...](http://project-
voldemort.com/javadoc/all/voldemort/client/StoreClient.html#applyUpdate%28voldemort.client.UpdateAction%29)
>). Having an agreed-upon transaction log helps too: there's no need to have
multiple consistency repair mechanisms for nodes that have been temporarily
out of service (in Dynamo's case it's read-repair, hinted hand-off and Merkle
trees): you just replay the delta in the transaction log (starting from the
latest checkpoint on the out of date replica and ending on the "high water
mark").

The apparent downside is more expensive writes (2PC can be pricier than simple
quorum writes -- which is sometimes an acceptable trade off), loss of fault
tolerance when the 2PC coordinator node crashes (in reality, this is likely
very rare), higher demands on the network. Anything that can create a
partition in a "CA" LAN-zone (e.g., network switches, Ethernet cables) becomes
a single point of failure (however, if you put all Cassandra/Voldemort/Riak
nodes on a single rack, behind a single switch, you've got the same problem).

The "unwritten" downside, is complexity and lack of symmetry in the
implementation: different components require different hardware and software
profiles and the like, failure of one type of node is more costly than failure
of other.

That's not to say it's wrong, it's just different sets of trade offs:
complexity of application development due to loss of replication transparency
i.e., having to deal with vector clocks vs. complexity of system development
and operations.

Note, that if you're read the BigTable papers from Google you'll see the same
story. Google File System _itself_ is actually eventually consistent (which is
fairly common e.g., CodaFS). BigTable itself, however, is strongly consistent:
Chubby (which employs Paxos, a form of 3PC) ensure that requests are routed to
specific tablet servers (which then check-pointed to GFS) and that only one
master was active at a time. Note that Paxos is _not_ used to explicitly agree
on a replicated transaction log (the "atomic, totally ordered multicast"
problem), my own guess is that this would simply be too expensive.

Note that in BigTable routing information is consistent across the entire
cluster, unlike a Dynamo system which uses Gossip for routing table
propagation (advantage being only needing to specify a smaller number of
"seeds" when connecting to/joining a cluster, disadvantages being not
_instantly_ seeing dead nodes -- needing additional failure detection -- and
potential scalability issues with Gossip on clusters >1000 machines).

When a single BigTable cell was insufficient, multiple BigTables cells were
setup with eventually consistency replication. Unfortunately there's no public
paper (yet) on Spanner, but my bet that Spanner is build on that same pattern
(eventually consistent layer on top of BigTable).

It's also worth it to read Google's "Paxos Made Live" which shows the
complexity of implementing Paxos: it's a beautiful algorithm, but with many
corner cases. Fortunately ZooKeeper (which provides Chubby-like functionality
e.g., distributed locks, "ephemeral nodes" and uses its own form of 3PC --
ZAB)is available and is open source.

Both Dynamo, PNUTs and BigTable have another unsaid assumption: if you want
technology to be an advantage, hire top talent e.g., Vogels, Dean, Ghewamat.
Most specifically, the sort of top talent that can take a "business problem"
and boil it down to the form where they're solvable by elegant data structures
and algorithms, deciding which trade offs make sense.

------
moonpolysoft
I remember the Yahoo guys saying they would open source PNUTS/sherpa at the
very first NoSQL meetup. It's a shame that it's been so long and they have not
followed through on that.

~~~
strlen
They open sourced the traffic server which is responsible for request routing.
However the traffic server came from an acquisition (Inktomi) and was always
intended as a commercial project that could run on other people's systems.

I'd wager that the real issue is dependency on some of the core Yahoo
libraries: there is a whole number of them and they're often so widely used
that lot of ex-Yahoo engineers are surprised to discover they're not open-
source. The reason they're not open source is that some of these would not be
useful outside of Yahoo and create maintenance problems e.g., features that
were once missing in FreeBSD, are available in recent versions of FreeBSD as
well as Linux but had to be ported forward to preserve API compatibility.

