
Dynamo: A flawed architecture  - ypavan
http://jsensarma.com/blog/2009/11/dynamo-a-flawed-architecture-part-i/
======
werner
Dynamo is still in use, but as with all technologies that have to operate at
Amazon's scale, the systems evolve rapidly. The storage systems in use now no
longer look like the ones from 5 years ago when Dynamo was developed.

The Dynamo SOSP paper had two goals: 1) show how systems are a composition of
techniques and how all of these need to work together build a production
system 2) given that it was based on a variety of research results it was
intended to give feedback to the academic community about the difficulties of
moving from research results to production, and what matters in real-life vs
production.

The paper was never intended to be a complete blueprint for easy design of
follow-up systems.There just isn't enough room in an academic paper to do
justice to that. If it was going to have any role in that the best we could
hope for was as a collection of points you would have to think hard about and
make decisions about when you were going to design your own storage engine.

That doesn't means I agree with the conclusions of the analysis, on the
contrary, I think they are seriously flawed as well (as with everything else
that is not absolutely perfect :-)). But I can understand that when people
look for the dynamo paper to be a blueprint that solves all their storage
needs and provides a perfect available service under all failure scenarios
(and solves world peace), that they may be left with a few questions
afterwards.

I always thought that the real contribution of the paper was that it made you
think hard about the trade-offs you are faced with when you have to design
high-available, ultra-scalable systems that are cost-effective and provide
guaranteed performance. 5 years later we know a lot more but this stuff is
still hard, and we still need to balance rigorous principles with production
magic to make it work. But you do need to fully understand the principles
before you can make the production trade offs.

Caveat: some of these remarks were tongue-in-cheek fun; I leave it to the
reader which ones :-)

------
jbellis
Basically, the author doesn't understand quorum protocols and draws a bunch of
erroneous conclusions because of that.

Here's the core misunderstanding:

"It is hinted that by setting the number of reads (R) and number of writes (W)
to be more than the total number of replicas (N) (ie. R+W>N) - one gets
consistent data back on reads. This is flat out misleading. On close analysis
one observes that there are no barriers to joining a quorum group (for a set
of keys). Nodes may fail, miss out on many many updates and then rejoin the
cluster - but are admitted back to the quorum group without any
resynchronization barrier."

That's because adding a barrier (a) isn't necessary for the consistency
guarantee, (b) doesn't add any extra safety in the worst case, and (c) adds
complexity (always be suspicious of complexity!)

Consider the simple case of a 3-node cluster (A, B, C) with N=3.

For quorum reads and writes, R = W = 2.

Then reads will be consistent if any two nodes are up. It doesn't matter if
nodes A and B are up for the write, then B and C are up for the read -- the
reader will still always see the latest write.

Of course you will have to block writes and reads if more than one node is
down for either operation. This is what the dynamo paper means when it talks
about allowing clients to decide how much availability to trade for
consistency.

The rest of the article is basically variations of this theme. (The only way
to provide "strong consistency is to read from all the replicas all the
time"!? Wrong, wrong, wrong.)

~~~
cx01
I think you're missing the fact that Dynamo uses sloppy quorums (see section
4.6 in the paper), i.e. writes don't have to happen on the first N members of
the preference list. For example, if W=3 and the first 3 nodes in the
preference list are down, the write would be taken by the next 3 nodes in the
preference list. Now if the first 3 nodes come back online, the system would
be in an inconsistent state.

~~~
jbellis
The Dynamo paper is very information-dense, so you have to do a little reading
between the lines. Here's the relevant quote from 4.6:

"Using hinted handoff, Dynamo ensures that the read and write operations are
not failed due to temporary node or network failures. Applications that need
the highest level of availability can set W to 1, which ensures that a write
is accepted as long as a single node in the system has durably written the key
it to its local store."

So, if W < N, then N - W writes can be written using hinted handoff to nodes
other than the final destination, and forwarded when the natural nodes come
back online. BUT the W writes specified by the requested consistency level do
have to be to the right nodes or everything falls apart.

This is the implementation that makes the most sense when taken with the rest
of the system design, so even though it's not quite spelled out I'm pretty
sure that's how it works. (I _am_ sure that's how Cassandra's consistency
levels work, as possibly the highest-profile eventually consistent open source
database. :)

~~~
cx01
> BUT the W writes specified by the requested consistency level do have to be
> to the right nodes or everything falls apart.

The paper doesn't say this. The passage you cited explicitly says that if W=1
any node in the whole system could take the write, contradicting your
statement.

Even then, it wouldn't make sense to apply your rule. If W writes have to
happen on the first N nodes in the preference list, then I wouldn't need
sloppy quorums at all.

~~~
jbellis
> any node in the whole system could take the write

No, it says "a single node," not "any node." The distinction is important: I'm
saying the former means "a single [real destination] node."

> If W writes have to happen on the first N nodes in the preference list, then
> I wouldn't need sloppy quorums at all.

Hinted handoff is there to make "eventually" happen faster for clients who
_don't_ specify R + W > N. Not to magically make W not matter for clients who
do, because clearly that doesn't work. (And if the authors didn't understand
that, why bother discussing W at all? it doesn't make sense.)

~~~
cx01
> No, it says "a single node," not "any node." The distinction is important:
> I'm saying the former means "a single [real destination] node."

But the paper explicitly says: "Applications that need the highest level of
availability can set W to 1, which ensures that a write is accepted as long as
a single node in the system has durably written the key it to its local store.
Thus, the write request is only rejected if all nodes in the system are
unavailable.". The last sentence makes clear that they mean any single node in
the whole system.

Additionally the paper states: "all read and write operations are performed on
the first N healthy nodes from the preference list, which may not always be
the first N nodes encountered while walking the consistent hashing ring.".
There is no mention that at least W of the first N nodes have to accept a
write.

~~~
jbellis
You're missing the forest for the trees. You're taking a pedantically literal
view of one sentence and saying "see? it can't work!" when that interpretation
of the sentence in question makes no sense at all when taken with the paper as
a whole.

If you want to keep arguing that Dynamo's HH ignores the W value they spend a
lot of effort explaining, fine, maybe you're right; maybe Dynamo's authors
really are a bunch of idiots. Nobody outside Amazon can say for sure.

But Cassandra doesn't ignore W, and neither does any other actual system I
know of: HH is just for the N - W nodes. Because that's the only way to
implement it that makes sense. :)

------
mbrubeck
[Disclosure: I used to work at Amazon, but not on Dynamo or Cart, and have no
non-public information about those systems.]

I think this is based on a flawed application of metrics from other
organizations that do not match Amazon's actual business needs.

The author talks about bank transaction systems with five nines of
availability. Sure, you can build a centralized system with a critical core
that has near-zero downtime - usually at great operational cost - but there's
not a bank in the world that's actually 99.999% available _to customers, over
the internet_. Most retail banks I've used take several hours of outages _per
month_ for maintenance on their web services.

Amazon wants its systems not just to keep running but to keep taking orders
from millions of people all over the world. This means that they are concerned
with the reliability not just of a storage server, but everything needed to
connect it to its application servers and to customers. They have found that
the cost-effective way to do this is to distribute every component across
geographically separated datacenters. Amazon has remained available through
real and simulated datacenter-wide outages ranging from power/cooling failures
to floods, fire, and hurricanes. No Amazon system lives within a single
building, much less a single network switch or rack.

Finally, although the formally provable aspects of Dynamo's "eventual
consistency" guarantee may be vague, any team operating a distributed system
will study and understand the actual operational characteristics in practice,
under both normal conditions and various failure modes. Some systems may have
realistic failures that lead to days or hours of inconsistency (in which case
the team has deliberately chosen this and will write client software to be
aware of it); others might be tuned to achieve consistency within milliseconds
under normal operation and to set off alarms within seconds after a failure.
I've never known any that would be used in circumstances where "human
lifetimes" are a relevant time period.

------
HenryR
I don't agree with all his conclusions. In particular, it seems as though he
is saying 'eventual consistency is no practical good' and then beating Dynamo
for a few paragraphs with the same stick.

The most often quoted example of Dynamo's use is the shopping cart application
on every Amazon page. In the worst case, your shopping cart will mysteriously
empty itself. This is a huge pain, and a potential loss for Amazon, but it's
not catastrophic in the way that is implied here. Indeed, assuming the
liveness of a quorum, the application will read back all conflicting entries
for the shopping cart (those that aren't ordered under their vector clock
timestamps) and the onus is on it to merge the conflicts. Of course, the
shopping cart will take the union of all updates to ensure that nothing is
dropped (and therefore some delete operations may be lost).

The key point is that some applications can do without observing a
linearisable history, and the interest of this paper is that it explores the
design space if you drop that requirement.

I don't understand the post's points about CAP; all three requirements are in
tension. Dynamo is unusual in that it is live in the case of a network
partition while still maintaining its consistency guarantees.

Similarly - those systems that use chain-replication asynchronously like he
describes can still suffer from the same read-old-value-after-it-was-written
consistency issue, if the reader jumps between two replicas for consecutive
reads. Avoiding that can require synchronous coordination of updates (a la
Paxos, e.g.) which is, I think, what the paper is driving at. Otherwise, there
are still failure modes which, in order to patch up, require stronger
guarantees about liveness of quorums than Dynamo needs.

I understand that Dynamo is no longer used internally at Amazon at scale, so
maybe some of the practical points this post makes about the realities of
central coordination held water for real deployments. Still, I don't buy the
reaction that prioritising availability uber alles and designing a system that
does not behave exactly like a strongly-consistent key-value store immediately
invalidates it for workloads that have high availability requirements and
lower consistency needs.

~~~
gruseom
_Dynamo is no longer used internally at Amazon at scale_

That's certainly relevant. I'd love to know more. Do you know why? What are
they doing instead?

~~~
HenryR
No - all I know is based on rumour.

------
ynniv
Author somewhat intelligently argues that Dynamo sacrifices more consistency
than it gains availability. Part of his argument is the difficulty of truly
leveraging partitioning to gain availability, mostly the practical issues of
users finding the currently available nodes in such a system. Another is
issues arising from nodes re-joining a cluster without first performing a
synchronization. He seems to favor the BigTable family, or simple master-slave
replication (assuming that the write-load is low enough for this to be
acceptable).

~~~
moonpolysoft
Except that he leaves out vector clocks and read repair which are mechanisms
critical to fixing consistency issues in a dynamo style system.

Also his treatment of partition tolerance assumes that operations set up the
Dynamo cluster machines incorrectly; all in the same rack.

He also completely glosses over merkle trees, assuming that the sync they help
perform must be a heavy operation.

------
jsensarma
@everyone - thanks for all the comments. i have incorporated the multiple
comments about the data loss scenario being incorrect when using vector
clocks. i have been thinking too much about Cassandra lately - and it doesn't
use vector clocks. that said - i continue to believe that returning stale
reads is bad and best avoided and that unbounded stale-ness is not acceptable
for many applications. To the extent that this is an avoidable scenario in a
tightly coupled environment within a single data center - i consider it to be
a significant drawback.

@jbellis - i hope the responses to your comment have convinced you about the
problems with the dynamo quorum scheme/read-write protocols. i can convey that
the problems i described definitely do exist in Cassandra. Jun Rao has made
this point in the Cassandra public mailing lists a long time back and i have a
pointer to the JIRA that he filed in my post as well.

regarding Dynamo being an interesting design space and for the academic
community etc. That may very well have been the case - but the reality of the
situation is that the world is now overflowing the Dynamo clones with people
considering them for all kinds of usages. Hey - if it was good for Amazon (and
Facebook and LinkedIn) - it's probably good for me! The people trying to use
Dynamo clones do not understand all the small details. They don't understand
what applications would be safe to write on it and which would not. I hope my
posts (imperfect and opinionated undoubtedly) would provide a counterpoint to
this sentiment and make users think harder and deeper before they make the
leap.

i was also hoping to trigger a discussion (looks like i succeeded). i hope
that it takes us to a better design space than what exists currently.

------
kvs
@Werner: Darn, someone figured out that Dynamo is a flawed architecture.
Luckily its only use is storing hundreds of millions of shopping carts :-)

------
codeslinger
Well, this guy works at Facebook on Hive. If there's no hope for him, how much
hope do the rest of us have ;-)

------
codeslinger
Also, Cassandra is a Dynamo clone? Someone should tell them that...

------
evgen
Abridged version: CAP is hard/scary and eventual consistency is too
complicated. Let's go shopping.

~~~
nl
Let's all go shopping, eventually.

------
tybris
If by flawed you mean scalable.

