

How Eventual is Eventual Consistency?  - pharkmillups
http://basho.com/blog/technical/2012/03/02/Eventual-Consistency-and-PBS/

======
pbailis
Shameless plug:

Interactive Demo: <http://bailis.org/projects/pbs/#demo>

PBS for Cassandra: <https://github.com/pbailis/cassandra-pbs>

~~~
jbellis
And an explanation of the PBS paper as applied to Cassandra:
[http://www.datastax.com/dev/blog/your-ideal-performance-
cons...](http://www.datastax.com/dev/blog/your-ideal-performance-consistency-
tradeoff)

------
cperciva
Good talk, but disappointing that he doesn't mention the issue of node
failures: With W=1 and a single node failure at the wrong time, you don't have
eventual consistency any more... you have data loss.

~~~
pbailis
This depends. Even with W=R=1, you still send each read/write to every
replica. If the replica that acknowledged the write fails, the other two
replicas should eventually receive the write.

But--what happens if the write messages to the other two replicas drop? Well,
typically the coordinator will hand the write off to replacement replicas
(hinted handoff/sloppy quorums). Alternatively, when you detect the message
loss via a timeout you could retry. But what happens if the coordinator dies
and can't retry? Then you'd better replicate your coordinator. That gets
hairy, so hinted handoff and sloppy quorums are easier. However, whichever way
you go, it's definitely possible to handle an arbitrary (still fixed) number
of node failures without data loss.

In general, though, sloppy quorums/hinted handoff solve this problem. I
haven't heard any data loss complaints with Riak/Cassandra/Voldemort due to
replication, but I'm very interested to hear if you have.

You can definitely extend PBS to node failure cases/sloppy quorums/hinted
handoff. The main reason we didn't was because we don't have good failure
data. There's nothing stopping you, and, as we point out in the paper/backup
slides, you can potentially hide this in the tail (e.g. the .01% of stale
cases) provided your DB "does the right thing".

~~~
cperciva
Indeed, if your coordinator isn't a replica you obviously need it to fail as
well in order to get data loss -- I was thinking of the symmetric case where
every replica is a coordinator and W=1 reduces to "store locally, broadcast,
and send an ACK".

And yes, of course there are hairy ways to solve the problem; I just would
have liked to see it mentioned so that people realize that it exists.

~~~
pbailis
You make a good point. Without additional safeguards/depending on
implementation, W=1 can indeed mean "durability of one".

This also depends on your failure model. If your node crashed (RAM cleared)
and your durable storage broke, you're in trouble. If the data was durably
persisted and you just have to restart the node, it's a better situation.

------
opendomain
Great presentation Mark! This is what I LOVE about Basho - use science and
metrics to show how their Riak compares to other NoSQL stores rather than
marketing.

TL;DR version: In all databases, you have to choose which 2 parts of CAP you
want: Consistency, Availability, Partition tolerance. If you choose
Availability- just HOW Consistent is my data? If I want my data to be more
consistent, then how does this affect my availability? This gives an actual
formula to calculate the best design that meets your applications goals.

------
victork2
I can't see the talk at the moment but here is my 5 cents on the subject:

The main problem of eventual consistency is not how often that happens, it is:
What damage will it do WHEN it happens?

Imagine you're a bank, you handle big clients you lose track of a write of 50+
million dollars. Where did the money go ? How to differentiate that from a
fraud attempt ?

If you have customers how will you tell them that you just lose,
probabilistically speaking "one in 10 million packages ?"

But that's a very interesting question that also has philosophical
repercussions: How come that we are in a society that did not build system
that accept a certain degree of failure?

~~~
pbailis
Eventual consistency isn't about "losing writes". It's about how long it will
take for all of your replicas to agree on/observe the last written versions
and, in the meantime, you'll read potentially stale data.

Certain data structures inherently tolerate staleness or message reordering:
look at your Twitter feed, any kind of log, or other "commutative data
structures". If you can't handle staleness, you should probably use stronger
consistency.

However, if you can find out about staleness after the fact (an asynchronous
callback, for instance), you can run some sort of compensatory routine (e.g.,
overdraft charges for your bank). Then you have an optimization problem: (cost
of compensation)*(number of times you have to run compensation) vs. the
benefit you get from weak consistency (latency, availability, whatever).

There's an awesome paper by Pat Helland about the problem you mention
regarding building real-world systems on top of inherently unreliable
components. It's called "Building on Quicksand":
<http://arxiv.org/pdf/0909.1788.pdf>

~~~
victork2
hmm I have been downvoted but I deserve it. So to rephrase what I do the
problem is a you say not losing data that is going to be inserted but
performing operations with the wrong type of data which mean errors.

Let's say I have 3 pieces of data required by a function to compute the
outcome of a certain operation (withdraw 10 billion dollars): A B C. We change
the data in that fashion: A -> A' B -> B' C -> C' -> C"

When I query, because of eventual consistency the f(A,B,C) may very well be:

f(A,B,C), f(A,B',C), f(A,B,C'), f(A,B,C"), ... so on. It is simple when you
have 3 sources, but when you have 50, and then when the operation use 50 or
those f, depending on 50 other pieces of data ?

Anyway, again sorry for my poor explanation of the issue!

