

Dynamo Systems Work Too Hard - jamesmiller5
http://damienkatz.net/2013/05/dynamo_sure_works_hard.html

======
jbellis
This misses the point.

There are two main reasons why, when I was researching scalable databases, I
primarily gravitated towards Dynamo-style replication (Cassandra, Voldemort,
and at the time, Dynomite):

\- There is no such thing as failover. Dynamo replication takes node failure
in stride. This is what you want for a robust system where "Network Partitions
are Rare, Server Failures are Not." Not only does it prevent temporary
unavailability during the failover, it rules out an entire class of difficult,
edge-case bugs. (Which every master-election-and-failover system out there has
been plagued with.)

\- It generalizes to multiple datacenters as easily as to multiple machines,
allowing local latencies for reads AND writes, in contrast to master-based
systems where you always have to hit the master (possibly cross-DC) for at
least writes. (Couchbase is unusual in that it apparently forces read-from-
master as well.) Cassandra has pushed this the farthest, allowing you to
choose synchronous replication to local replicas and asynchronous to remote
ones, for instance: <http://www.datastax.com/docs/1.2/dml/data_consistency>

/Cassandra project chair

~~~
damienkatz
I'm not denying that Dynamo features more consistent availability, but it does
so at either cost of temporal consistency or a much larger amount of
resources. That's the point of the article, the tradeoffs are expensive and
you can achieve effectively the same performance and availability with fewer
resources.

Dynamo may rule out a certain class of bugs, but that doesn't mean other
systems must also have those bugs.

And Couchbase also has multi-data center master/master capabilities, with any
topology: chain, ring, hub and spoke or any combination therein.

~~~
strlen
> I'm not denying that Dynamo features more consistent availability, but it
> does so at either cost of temporal consistency or a much larger amount of
> resources.

The trouble is that your article outlines problems with pretty much all quorum
systems, including multi-Paxos: the setup where there's an elected leader --
elected by first round of Paxos -- which then performs subsequent writes using
a single-round-trip second-round.

Incidentally, that is very close to what you've proposed, except you've chosen
to not perform quorum reads. That is perfectly fine, but there are also hidden
costs -- not only do you need to have a leader election, but you also need
synchronization barriers. Before a node can handle writes or serve latest
reads, it must also back-fill its transaction up to the last entry (which it
does by reading from peers).

So while you are saving on read traffic (online reads only go to the master),
you are now decreasing availability (contrary to your stated goal), and
increasing system complexity.

You also _do_ hurt performance by requiring all writes and reads to be
serialized through a single node: unless you plan to have a leader election
_whenever_ the node fails to meet a read SLA (which is going to result a
disaster -- I am speaking from personal experience), you will have to accept
that you're bottlenecked by a single node. With a Dynamo-style quorum (for
either reads or writes), a single straggler will not reduce whole-cluster
latency.

The core point of Dynamo is low latency, availability and handling of all
kinds of partitions: whether clean partitions (long term single node
failures), transient failures (garbage collection pauses, slow disks, network
blips, etc...), or even more complex dependent failures.

The reality, of course, is that availability is neither the sole, nor the
principal concern of every system. It's perfect fine to trade off availability
for other goals -- you just need to be aware of that trade off.

You may want to note evolution of Google's services: BigTable -> MegaStore ->
Spanner. Essentially, they've started out with a non-HA system (BigTable),
found that every team has began implement HA on their own, and then added
quorum-based protocols (Paxos), finally adding an optimization (TrueTime) to
reduce read-latency.

~~~
IGuardBridges
One quick point - AFAIK Megastore uses quorum based across data centers but
uses BigTable within a data center - so master based. Is that not true? If it
is then your point does need to be qualified - cross center DC failure models
are (I assume) different from those within a DC.

------
pbailis
There's at least one good reason for Dynamo's write-to-all and read-from-all
mechanism: latency.

What you've called 'W=2' in Couchbase is "write to master and at least one
slave." Dynamo-style 'W=2' means "write to any two replicas." This can
decrease tail latencies since you don't have to wait for the master--any two
will do; similarly for 'R=2'. Indeed, Dynamo 'W=2, R=2' will incur more read
load than master-based reads (at least double, but not necessarily triple, in
your figures). So I think it's more accurately a trade-off between latency and
server load.

There can be big benefits to this redundant work. For example:
[http://www.bailis.org/blog/doing-redundant-work-to-speed-
up-...](http://www.bailis.org/blog/doing-redundant-work-to-speed-up-
distributed-queries/)

But don't take it from me: [http://cacm.acm.org/magazines/2013/2/160173-the-
tail-at-scal...](http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-
scale/fulltext)

Anyway, I'm pretty sure CASSANDRA-4705
(<https://issues.apache.org/jira/browse/CASSANDRA-4705>), which allows for
Dean-style redundant requests, both decreases the read load (at least from the
factor of N in your post) _and_ should still reduce tail latency without
compromising on semantics.

I don't have skin in this game, but I'm pretty sure that the Dynamo engineers
had a good idea of what they were doing. (That said, the regular [non-
linearizable] semantics for R+W>N are sort of annoying compared to a master-
slave system, but can be fixed with write-backs.)

~~~
damienkatz
Good point. But "writes" are very fast, in our tests write latency is less
than half read latency, so we can easily do master to slave replication within
the SLA. But you point is correct, a Dynamo system is faster to achieve the
same replication factor.

~~~
jbellis
I think it's worth elaborating that the primary advantage to the dynamo model
is not in the best- or average- case, but when everything does _not_ go as
planned -- when the master gets behind, when ec2 network latency spikes, etc.
Then "any two replicas" instead of "master plus one more" is much more robust.

------
gigq
This also exactly describes how HBase works. I've always preferred HBase to
Cassandra for this exact reason. You put far less read load on your servers
and you don't have to worry about most of the things on
<http://wiki.apache.org/cassandra/Operations>.

Another benefit that is not mentioned is that with a master based system you
can easily move who is responsible for the data if a server starts to hotspot.
In Cassandra you have to use random key distribution because if you have a
server hotspot then the only solution is to split the token ring which is an
intensive operation that is hard to do while the server is under heavy load.

~~~
linuxhansl
Yep.

I hear the "master design is bad" argument all the time.

From many angles it _is_ a bad design, but from other's it is not. First off,
it is simpler and easier grok and check for bugs. Debugging a running system
is easier.

It is also easier to implement different distribution strategies and
failure/placementgroups, because that algorithms is centralized. If things go
wrong it is easier to track where your data is.

No gossip rings to converge. And it is easier to grow the cluster.

/HBase committer here.

~~~
linuxhansl
Thanks for the down vote.

If you find any technical inconsistencies in what I said, please let me know.

------
gukjoon
Great article, Damien. This idea that network partitions are exceedingly rare
was the reason why ElasticSearch goes CA vs. the AP many other NoSQL
datastores choose.

[http://elasticsearch-users.115913.n3.nabble.com/CAP-
theorem-...](http://elasticsearch-users.115913.n3.nabble.com/CAP-theorem-
tp891925p894234.html)

Not only are network partitions rare, the most disastrous case where the
cluster splits in half is even rarer. Usually, you have a small part of the
cluster partition away.

I hope people don't take this as a Dynamo vs. Couch discussion, because the
relative importance of partition tolerance is a topic that spans all
datastores that give up on ACID.

~~~
pkolaczk
Network partitions are extremeley rare only for small clusters. For very large
clusters or multi-datacenter clusters there is much more hardware than a
single switch between servers. Then, the likelihood that something cuts off
the whole room full of servers from the rest of your cluster is not something
safe to neglect.

------
crb
When your units of networking concern are "availability zones" (i.e. data
centers) rather than just switches, wouldn't network failures now be more
common than server failures?

~~~
bscanlan
Agreed - the assumption is that the MTBF of a single network device is a
relevant statistic is undoubtedly incorrect in anywhere but a hobbyist's
network.

------
jeremiahjordan
even if switch failures are rarer, couch at W=1 will silently drop data for
network partition, dynamo at W=2 won't, how is the comparison at the end
valid?

~~~
damienkatz
"...if the client wanted true multi-node durability, then the write wouldn't
have succeeded (the client would timeout waiting for replicas(s) to receive
the update) and the client wouldn't unknowingly lose data."

~~~
jeremiahjordan
But that isn't W=1

~~~
damienkatz
Ok, you're right. I just changed it to W=2, and all the numbers are the same.

------
pkolaczk
Why compare MTBF of a single network switch to MTBF of a node? Why not compare
MTBF of a single network switch to MTBF of a single CPU or a motherboard?
Unless you're talking about hobby-size network, there is usually much more
between the nodes than a single network switch.

------
leef
Despite the name you can't actually assume DynamoDB is based on the Dynamo
paper architecture.

~~~
bscanlan
"The original Dynamo design was based on a core set of strong distributed
systems principles resulting in an ultra-scalable and highly reliable database
system. Amazon DynamoDB, which is a new service, continues to build on these
principles, and also builds on our years of experience with running non-
relational databases and cloud services, such as Amazon SimpleDB and Amazon
S3, at scale."

[http://www.allthingsdistributed.com/2012/01/amazon-
dynamodb....](http://www.allthingsdistributed.com/2012/01/amazon-
dynamodb.html)

