
Why Cassandra Doesn't Need Vector Clocks - nickmbailey
http://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector-clocks
======
seancribbs
Since they're not likely to approve my comment on their blog, here's what I
said:

"Way to misrepresent[1] vector clock usage in Riak! LWW deliberately ignores
the vector clock. No one would use that in production without a strong
assurance that they will never have concurrent writes. Also note that later in
the post[2] Kyle shows how using them properly leads to zero data-loss.

[1]
[https://yourlogicalfallacyis.com/strawman](https://yourlogicalfallacyis.com/strawman)
[2] [http://aphyr.com/posts/285-call-me-maybe-
riak"](http://aphyr.com/posts/285-call-me-maybe-riak")

To add on,

> Cassandra addresses the problem that vector clocks were designed to solve by
> breaking up documents/objects/rows into units of data that can be updated
> and merged independently. This allows Cassandra to offer improved
> performance and simpler application design.

This is not solving the problem vector clocks solve, it is punting on the
resolution issue. Perhaps LWW partial updates result in greater performance,
but they only solve performance.

Listen to or watch
[http://thinkdistributed.io/blog/2012/08/28/causality.html](http://thinkdistributed.io/blog/2012/08/28/causality.html)

To be fair, both designs are valid choices, but jbellis should be honest about
his tradeoffs and not simply cast aspersions on other valid designs because
they aren't the one that C* chose.

~~~
aphyr
I concur; this is punting on the resolution problem.

As far as I can determine in testing with Jepsen, there are no cases where one
can safely (e.g. in a way which guarantees some causal connection of your
write to a future state of the system) update a cell in Cassandra without a
strong timestamp coordinator: either an external system like Zookeeper, or
Cassandra 2.0 paxos transactions.

Most of the production users and datastax employees I've spoken with recommend
structuring your data as a write-once immutable log of operations, and to
merge conflicts on read. This is _exactly_ the same problem that Riak siblings
have--except that you don't get the benefit of automatic pruning of old
versions, because there's no causality tracking. GC is up to the user.

~~~
cbsmith
> As far as I can determine in testing with Jepsen, there are no cases where
> one can safely (e.g. in a way which guarantees some causal connection of
> your write to a future state of the system) update a cell in Cassandra
> without a strong timestamp coordinator: either an external system like
> Zookeeper, or Cassandra 2.0 paxos transactions.

It depends on what you mean by "guarantees". In most real world systems, if
you can actually have two concurrent writes to the same record, it is non-
deterministic which write wins... which is precisely what happens in Cassandra
if you have multiple concurrent writes to a cell without a strong timestamp
coordinator.

There is a somewhat more complex case about scenarios where you have multiple
cells tied to the same record being updated together, but even then it will
look identical to if you allowed both writes to happen, but it was non-
deterministic which one happened first.

While you can _create_ an timestamp authority that arbitrates precisely what
happens when, the reality is that if there wasn't an already observable
mechanism for determining this, who is to say which one happened first?

Really, the only problem you run into is if there is atomic validation logic
you need tied in to whether the updates happen at all, which is what Cassandra
2.0's lightweight transactions really address. Without them, you generally
_do_ need some kind of external authority or a "log concurrently and then
resolve serially later" strategy. That sounds like a pain, but the latter in
particular is again closer to how most of the real world operates (in
particular, the world of financial transactions).

~~~
aphyr
The world of consistency is rich: not all systems require serializability,
linearizability, or any one write winning. It might be interesting to skim
[http://pmg.csail.mit.edu/papers/adya-
phd.pdf](http://pmg.csail.mit.edu/papers/adya-phd.pdf),
[http://ftp.research.microsoft.com/pub/tr/tr-95-51.pdf](http://ftp.research.microsoft.com/pub/tr/tr-95-51.pdf),
and pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-6956.pdf‎ for a taste.

~~~
carterschonwald
(asking for my self and other readers of this thread)

is LWW = Last Write Wins?

~~~
nosequel
Yes.

------
astral303
Cassandra counters once were implemented as a blazing-fast vector-clock patch
[1,2] that did not involve read-before-write on update. The biggest (IMO)
downside is that it did not support decrements. So Cassandra tech leadership
decided [2] to implement counters as "read, increment, write." However, that
came with two major downsides: 1) counter updates were now no longer
replayable--so if you experience an update timeout on the Cassandra side under
heavy load or due to a lengthy compaction or some network fluke, you just
can't replay it. You need to hope that it has made it over. Counters run on
hope. But you can decrement them now. 2) read-before-write can severely limit
performance in the case of a large number of counter writes (as the workload
becomes much more read-intensive, unlike regular blazing-fast Cassandra
writes).

1 -
[https://issues.apache.org/jira/browse/CASSANDRA-580](https://issues.apache.org/jira/browse/CASSANDRA-580)
2 -
[https://issues.apache.org/jira/browse/CASSANDRA-1072](https://issues.apache.org/jira/browse/CASSANDRA-1072)

However, in many many a use case for counters, you just don't need to
decrement. And if you need to decrement, you can often get away with two
counters, that you subtract from each other. So that's a shame, as restricting
the use case for counters to be monotonically increasing would've been just
the right trade-off for many users.

~~~
jbellis
I think you've misread the 580 patch. By definition, you need to read the
existing vector clock value before you can issue an update against it.

/Member of the presumedly incompetent Cassandra tech leadership

~~~
astral303
Based on the algorithm described in the link here:
[http://www.slideshare.net/kakugawa/distributed-counters-
in-c...](http://www.slideshare.net/kakugawa/distributed-counters-in-cassandra-
cassandra-summit-2010) \-- the insert path does not perform any reads, unless
the column is already in the memtable (so no disk read). Am I missing
something?

~~~
jbellis
That is describing the 1072 approach aka the one in Cassandra today.
Incrementing without read-before-write was one of the reasons Twitter liked
it.

Unfortunately, while you can increment _on a single node_ without read-before-
write, you need to merge it with the existing value before you can replicate
to the others. Read-before-replicate, if you will.

Theres a more recent, quite long discussion over
[https://issues.apache.org/jira/browse/CASSANDRA-4775](https://issues.apache.org/jira/browse/CASSANDRA-4775)
that boils down to, "this is actually the best we can do."

------
AaronFriel
That's great, but what if you want vector clocks in Cassandra? For example,
storing opaque blobs encrypted by a remote client. You might want to store
both versions. Does Cassandra offer the ability to store lists of primitive
data types and atomically insert/remove elements? This would allow clients to
atomically append and then (with less need for guaranteed ordering)
asynchronously attempt to remove the ancestor record. Even if the removal
fails, future gets would pull the list and again attempt to synchronize to a
single record.

I'd suppose that under very high contention you could have some problems, but
at least you'd be able to implement vector clocks as a library without
touching the Cassandra core.

Apologies if my question is common knowledge. And for anyone who is well
versed in other key value stores, does it support vector clocks or something
like the above (list elements with atomic append/prepend?)

~~~
jbellis
If you want to manage opaque blobs then you should probably use Riak or
Voldemort. Cassandra gives you a set of one-size-fits-most tools that by and
large work very well together, but if you want more of a build-your-own-
database toolkit, it is a bad fit.

------
ghc
I don't see what's so scary about vector clocks. I have no trouble with them
in Riak. Sure, there's a little extra logic to deal with certain types of
siblings, but honestly I haven't encountered any difficult edge cases.

But then again, maybe it's highly dependent on what you use Riak for. Anyone
out there have any type of application they've built with Riak that did
encounter issues with using vector clocks?

------
gigq
It's worth noting that HBase has the same method of using LWW on column level
updates. While this usually is what you want and like Cassandra it gives you
the ability to do fast blind writes there are sometimes that you need to make
sure you aren't having conflicting writes.

The solution that HBase employs is to have checkAndPut functionality.
Basically what this lets you do is write a value and only successfully save
the update if a given column is the value you expect.

So for example you could have a "records" table that has a column called
"check" whenever you update a record you pull the old one, do whatever
processing you want to do on it, set the "check" column to a new UUID and then
save it with a checkAndPut where you specify the "check" column in hbase has
to be the old check value you read. If any other process wrote to this row
then it would have updated the check value with a new UUID and so this
checkAndPut will fail thereby detecting the conflict. Now you can repull the
row and handle any conflict resolution without blindly overwriting the
changes.

~~~
jbellis
Cassandra exposes functionality similar to checkAndPut as Lightweight
Transactions: www.datastax.com/dev/blog/lightweight-transactions-in-
cassandra-2-0

~~~
gigq
Ah, good to know that functionality was added. I haven't used Cassandra since
the 0.6.x days so my knowledge of what's possible there is rusty.

------
banachtarski
This isn't anything dramatically new. Essentially, there is a tension between
data normalization and requirements for strong consistency.

If my data is more segmented, I can afford to be less stringent about it. I
think that foregoing vector clocks in an distributed database entirely though,
is a recipe for disaster, no matter how denormalized my data is.

------
marshray
Riak/C* noob here. Couldn't I just store data as column cells in Riak?

    
    
        bucket: 'users'
        'jbellis/email', 'jbellis@example.com'
        'jbellis/phone', '555-5555'
    

What would I lose by doing this?

~~~
iamaleksey
Performance, mostly. To read both the email and the phone you'll have to make
two requests (or a single multiget). The keys have a good chance of belonging
to different replicas, too, and even if they don't, you are still going to
have 2x disk reads.

~~~
jbellis
You're also giving up atomicity on writes when you actually do want to update
multiple fields at once.

------
jtuple
Avinash Lakshman make a decision early in Cassandra's life cycle not to use
vector clocks for performance reasons. You gain performance, you lose the
ability to safely update an existing key concurrently.

This is a reasonable tradeoff.

Cassandra has since evolved around this limitation: use write-once or
immutable data, transform updates into appending new columns rather than
modifications, handle all of this automagically through CQL datatypes, etc.

Really cool stuff. A nice way to evolve a product around a limitation and
still be immensely useful.

But, seriously, it's a tradeoff. Defending that choice against those who
question Cassandra is fine. But, jbellis has a history of claiming Cassandra
is a "one-size-fits-most tool", proclaiming that vector clocks aren't needed
in 99% of cases, etc. That's opinion presented as fact. Without omniscience no
one really knows what all users/the market actually needs, what 100% of all
use cases look like, etc. Let's cut the rhetoric and realize engineering is
about tradeoffs.

I still wonder if Cassandra won't someday add vector clocks, just like they
eventually added vnodes.

In Riak land, we have vector clocks. Do we pay a performance hit? Yes. Must
we? Maybe.

In theory, vector clocks should only be expensive when updating existing data.
If you are writing data once (the standard Cassandra approach), vector clocks
should be free. They're not currently, but that's an implementation detail
that Riak can fix. And we're considering fixing in the near future.

What about when you have multiple updates to the same key? You must pay a
penalty there, right? Maybe. We're actively looking at approaches to reduce
that penalty in the future as well. Summary: have multiple versions of an
object, just append new version on write, read all versions on read and rollup
siblings/LWW resolve. Basically, identical to "just append a column" that
Cassandra uses.

It's easy to extend Riak to support the same operations Cassandra does, with
the same performance characteristics, while still supporting conflict-
detection for concurrent modifications. Best of both worlds. We'll like do it
at some point too, as that's what engineers do -- we make products better over
time.

As an side, a benefit of per-key conflict detection and siblings is that it
doesn't require a sorted backend. The multi-version approach Cassandra uses
requires sorted data to be efficient on reads.

While most users of Riak use the LevelDB backend today which is a sorted
backend similar to Cassandra's backend (both log-structured SST systems), Riak
also supports the non-sorted Bitcask backend for folks that want a high
performance purely K/V store. Bitcask is about 10x faster than LevelDB in many
workloads, because it doesn't waste time sorting/rewriting data. Write
amplification really hurts for certain workloads.

Yet another tradeoff. Engineering is always about those pesky tradeoffs.

------
rdtsc
I don't know much about CAP but it doesn't sound right to me.

Let's see, you store your data tuple for example for a position say {X,Y} as 2
coordinates. X in one column and Y in another one.

Now when it gets updated concurrently and it needs to merge two conflicting
{X1,Y1} and {X2,Y2} positions you could end up instead with {X1,Y2} non-
existing/impossible/broken position.

Is that really that easily broken? Or am I not thinking about it straight.

~~~
brown9-2
In Cassandra the second update agreed upon by the cluster would "win", so
either (X1,Y1) or (X2,Y2). Since there is no clock mechanism there is no way
for the cluster to see that the data has changed since the requester decided
it wanted to update the value.

~~~
aphyr
Rows are not isolated. You might see (X1, Y1), (X1, Y2), (X2, Y1), or (X2, Y2)
if timestamps happen to conflict.
[https://gist.github.com/aphyr/6402464](https://gist.github.com/aphyr/6402464)

~~~
brown9-2
Oh wow, that is bad. I was assuming that guarantee of row isolation was valid.

