
Cassandra is not row level consistent - leastangle
http://datanerds.io/post/cassandra-no-row-consistency/
======
jbellis
Cassandra developer here.

Lots of comments here about how Cassandra is AP so of course you get
inconsistent (non-serializable) results.

This is true, to a point. I'm firmly convinced that AP is a better way to
build distributed systems for fault tolerance, performance, and simplicity.
But it's incredibly useful to be able to "opt in" to CP for pieces of the
application as needed. That's what Cassandra's lightweight transactions (LWT)
are for, and that's what the authors of this piece used.

However! Fundamentally, mixing serializable (LWT) and non-serializable (plain
UPDATE) ops will produce unpredictable results and that's what bit them here.

Basically the same as if you marked half the accesses to a concurrently-
updated Java variable with "synchronized" and left it off of the other half as
an "optimization."

Don't take shortcuts and you won't get burned.

~~~
isoap
> Don't take shortcuts and you won't get burned.

I'm sorry that the user did something unexpected and then posted about it in a
way that made your application look bad. I know that must be frustrating.
However...

Calling the user out as doing something wrong when your application is failing
because of a use case you can't handle properly just looks bad. You serve your
users, not the other way around. Don't forget that.

If it were me and there were a case that my application couldn't handle
properly, if I couldn't fix it, I'd raise an error, and then document clearly
that they should not do this, such that when they search for that error,
they'd find the answer. Then, I'd work to see if there were a way I could
avoid the error altogether by not allowing that use case.

~~~
nathankunicki
There is no "one size" fits all database or distributed system in existence.
To say that there is is to say that MySQL is equally as well suited to all use
cases as Cassandra or Redis. They all store and serve data, right?

Cassandra makes no claims to be such a holy grail. Read their documentation,
and you can see the use cases it is good for and those it is not.

The author of this blog post chose one it is not good for.

Put another way, "I'm sorry that the Lamborghini you bought broke when you
attempted to go off-roading with it. Perhaps you should have bought a Jeep
instead?"

------
helper
As a long time Cassandra user its easy to forget that some of Cassandra's
semantics will be surprising to new users. That being said, if you are
considering adopting an AP database it really is important for you to know the
details about how write conflicts get resolved. This is perhaps the biggest
difference between Cassandra other databases like Riak and ought to be part of
your decision making process instead of a surprise you run into later.

That being said, using Cassandra for distributed locks is a terrible idea. I
can't think of any way in which Cassandra would be better than using
{Zookeeper,etcd,consul}. Trying to force a database to do something it really
isn't designed for will almost always lead to disappointment (and often
resentment) of said database.

~~~
siculars
Seconded. Don't use an AP system for distributed lock management. As with most
things, use the right tool for the right job.

When I speak to developers about Riak I tell them the biggest difference
between systems like Riak (including Cassandra) and traditional relational
systems is not the data model, ie. relational vs non-relations (or structured
vs unstructured) but rather the architecture, ie. distributed vs not
distributed. As in a "C" system vs a "non-C" system, where "C" is consistency.

Disclaimer, I work for Basho, makers of Riak.

~~~
crypto5
I think you can make Cassandra "C" by using quorum reads and writes. Also they
support some SERIAL consistency level, which uses PAXOS underneath, and
supposedly achieves consistency as well.

~~~
mdani
The IF statement uses Paxos as underlying implementation and still it may
result into an inconsistency if you are using two separate CQL statements.
Cassandra does not provide begin ... end construction where everything is
atomically committed or rolled back.

~~~
pkolaczk
You can use Cassandra LWT as a building block for multi-partition
transactions, in a similar way like CAS atomic operations are used to build
locks, mutexes and monitors and then monitors are used to implement ACID
transactions in RDBMS. However, noone says it would be easy or performant,
therefore this is probably not a good idea.

~~~
mdani
Right. Cassandra crawled at 20 TPS when we sent IF cql statements. One thing
to note is that Cassandra txns can never be rolled back - they can only be
committed/ retrried and so on.

------
Animats
This:

    
    
        INSERT INTO locks (id, lock, revision)
        VALUES ('Tom', true, 1)
        IF NOT EXISTS USING TTL 20;
    

looks like a race condition. The same problem comes up in SQL databases - you
can't lock a row that doesn't exist yet. If you write, in SQL:

    
    
        BEGIN TRANSACTION
        SELECT FROM locks WHERE id = "Tom" AND lock = true AND revision = 1;
        -- if no records returned
        INSERT INTO LOCKS locks (id, lock, revision) VALUES ('Tom', true, 1)
        COMMIT
    

you have a race condition. If two threads make that identical request near-
simultaneously, both get a no-find from the select, and both do the INSERT.
SELECT doesn't lock rows that don't exist.

The usual solution in SQL is to use UNIQUE indices which will cause an INSERT
to fail if the record about to be inserted already exists.

I ran into this reassembling SMS message fragments, where I wanted to detect
that all the parts had come in. The right answer was to do an INSERT for each
new fragment, then COMMIT, then do a SELECT to see if all the fragments of a
message were in. Doing the SELECT first produced a race condition.

~~~
dhd415
Some SQL databases (e.g., SQL Server and PostgreSQL) offer key-range locking
which, along with a serializable isolation level, will prevent this race
condition.

------
aartur
And there's another surprise waiting to be discovered. The execution of a LWT
is not guaranteed to return applied/not-applied response [1]. It can raise a
WriteTimeout exception that means "I don't know if applied". It looks like in
that case it can be worked around by inserting a UUID and in case of a
WriteTimeout reading the UUID using SERIAL consistency and checking if it's
the inserted UUID. But generally this limitation of LWTs makes implementing
some algorithms impossible, e.g. you can't implement a 100% reliable counter.

[1]
[https://issues.apache.org/jira/browse/CASSANDRA-9328](https://issues.apache.org/jira/browse/CASSANDRA-9328)

------
seanparsons
I railed against CQL right from the start and it's precisely because of this
kind of thing. Imitating SQL has the side effect of setting certain
expectations and drags a certain mental model along with it.

------
im_down_w_otp
If you're not writing purely immutable data or can't 100% guarantee a
serialized reader/writer, then you're just looking for trouble with Cassandra.

------
beefsack
The post was quite interesting, but I find image macros and GIFs really
distracting in technical writings.

------
zzzcpan
Shouldn't Cassandra be using Lamport timestamps or even vector clocks there?
Relying on timer and its resolution sounds strange for a database, especially
a distributed one.

~~~
parenthephobia
Cassandra's timestamps can be specified by the client: it's possible to use
any integer as your timestamp, such as Lamport timestamps, or even atomic
distributed counters (possibly using Cassandra's own counters).

Vector clocks are still out, though.

Edit: Actually, you can't do Lamport timestamps because you can't query the
current value of a timestamp.

Re-edit: That's wrong. I shouldn't believe any old blog I find. (I'm leaving
it in the comment because it was quoted in a reply.)

~~~
chillaxtian
> Edit: Actually, you can't do Lamport timestamps because you can't query the
> current value of a timestamp.

you can get the timestamp of each column back as part of a SELECT.

is that not enough?

------
leastangle
Author here.

Great discussion around the CAP theorem but it misses the point. AP vs CP /
Cassandra being AP is not relevant to this particular problem:

1) This is not a distributed systems corner case. You will run into this if
you are running Cassandra on a single node. A node should be able to guarantee
consistency internally during normal operation. If it is not able to do that,
there is something wrong with the system.

2) This is a case where queries are being send from the same process/thread
and go to exactly the same nodes. Attach a simple, monotonically increasing
query counter to each call and you can easily serialize it on the other side.

------
sigy
I see this as a basic misunderstanding of how LWT works. If you want to ensure
serializable operations, then you need to use LWT with preconditions that
ensure serializable operations.

Even better, stop trying to emulate the old and tired distributed lock methods
that have been proven over and over again to be insufficient.

------
refresh-creds
If it's not obtained synchronously what does the lock achieve?

------
supergirl
Couldn't explain the problem in a less cringy way? Sounds like a bug to me. So
file a bug report?

~~~
detaro
I think not strictly a bug, just surprising behavior. They linked to the
documentation that shows that it is "expected" to work like that, and to two
requests for features to help mitigate this issue.

~~~
supergirl
bug/feature whatever. CASSANDRA-6123 is exactly addressing row level
consistency. but here maybe they found another case.

------
agentgt
I guess I'm old or just not hip (most likely both) but I had to google WAT (I
know WTF but WAT ... never seen it).

Even now I'm still not sure but I presume WAT = what!

~~~
pushrax
It is a particular form of "what" that is used when dumbstruck.

~~~
webmaven
Closely related is WUT, particularly the popular variant LOLWUT:
[http://knowyourmeme.com/memes/lolwut](http://knowyourmeme.com/memes/lolwut)

------
second_picard
Hazelcast has a distributed lock (see
[http://docs.hazelcast.org/docs/3.7/manual/html-
single/index....](http://docs.hazelcast.org/docs/3.7/manual/html-
single/index.html#lock)) and I've used for more than a year to synchronize
jobs across a cluster.

------
steeve
Using Cassandra as a lock is a terrible, terrible idea.

------
known
With the Oracle/PostgreSQL, readers never wait for writers and writers never
wait for readers [http://philip.greenspun.com/sql/your-own-
rdbms.html](http://philip.greenspun.com/sql/your-own-rdbms.html) using
underlying locking mechanism
[http://www.beej.us/guide/bgipc/output/html/singlepage/bgipc....](http://www.beej.us/guide/bgipc/output/html/singlepage/bgipc.html#flocking)

------
known
Its' better to implement
[https://en.wikipedia.org/wiki/Priority_inversion](https://en.wikipedia.org/wiki/Priority_inversion)
in all
[https://en.wikipedia.org/wiki/NoSQL#Types_and_examples_of_No...](https://en.wikipedia.org/wiki/NoSQL#Types_and_examples_of_NoSQL_databases)

------
cbsmith
It feels like it is 2013 all over again: [https://aphyr.com/posts/294-jepsen-
cassandra](https://aphyr.com/posts/294-jepsen-cassandra)

------
mydpy
As others have noted, this blog uses an approach to data modeling that is
considered an anti pattern for an AP data store like Cassandra.

------
neeleshs
Anyone has experience around this on HBase, a CP database?

~~~
linuxhansl
Apache HBase committer here.

HBase is strictly CP (except for its geo-replication, and optional timeline-
consistent region replicas).

It uses MVCC for row "transactions" to always keep rows consistent. HBase also
has checkAndPut and checkAndDelete primitives, which are atomic, as well as
Increment and Append, which are atomic and serializable.

[http://hadoop-hbase.blogspot.com/2012/03/acid-in-hbase.html](http://hadoop-
hbase.blogspot.com/2012/03/acid-in-hbase.html) explains it fairly well.

Together with Apache Phoenix you have full multi-row transactions, but they
come with a price obviously.

