Hacker News new | past | comments | ask | show | jobs | submit login
Cassandra is not row level consistent (datanerds.io)
274 points by leastangle on Nov 24, 2016 | hide | past | favorite | 121 comments

Cassandra developer here.

Lots of comments here about how Cassandra is AP so of course you get inconsistent (non-serializable) results.

This is true, to a point. I'm firmly convinced that AP is a better way to build distributed systems for fault tolerance, performance, and simplicity. But it's incredibly useful to be able to "opt in" to CP for pieces of the application as needed. That's what Cassandra's lightweight transactions (LWT) are for, and that's what the authors of this piece used.

However! Fundamentally, mixing serializable (LWT) and non-serializable (plain UPDATE) ops will produce unpredictable results and that's what bit them here.

Basically the same as if you marked half the accesses to a concurrently-updated Java variable with "synchronized" and left it off of the other half as an "optimization."

Don't take shortcuts and you won't get burned.

I agree, Cassandra is doing here exactly what I (as a user) would expect.

When using a DB like C*, you always have to ask yourself, "does this update happen before or after this one - have I done anything to ensure that's the case?"

In this example, the second query (the UPDATE) is being partially applied before the first query (the INSERT) - and that's OK, because there's no ordering or dependency in the second query that forces it to run second. So the reordering of the queries that he's observing is legal, and can be simply avoided with "UPDATE ... IF revision = :last-rev".

My biggest complaint about C: SQL-like interface. These seems like a terrible decision to me.

If you are used to relational databases and SQL, it's a real struggle to get your head in the right place. Some of the gotchas:

Cell level vs row level consistency/locking (as stated in this article) * Grouping / counting isn't really a thing. * WHERE statements are actually used for identifying the key/value pair you want and not for filtering the rows. * Indexes aren't really indexes. They are definitions of partitions and sort order.

That's actually my favourite part of C*. I think the CQL interface makes programming against it extremely easy, and I dare might say enjoyable.

It makes getting standard, and understanding what's going on a breeze.

> the second query (the UPDATE) is being partially applied before the first query (the INSERT) - and that's OK

"Partially applied" is ok with a database?

The description of Cassandra on it's site is "Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data."

If it's mission-critical data, I wouldn't do arbitrary things with it for conflict resolution that can corrupt data.

Cassandra is perfectly fine. If you want your writes to be consistent, learn to use LWTs properly. The same way, if you want your data to be fully consistent in RDBMS, learn to use SERIALIZABLE isolation level (which is not default in most RDBMSes for performance reasons). If, in an RDBMS, you use SERIALIZABLE for half of the updates and READ UNCOMMITTED for another half, guess what consistency guarantees do you really get?

This is not arbitrary. It may be not intuitive coming from a rdbms background, but it isn't arbitrary.

As Johnathon pointed out, it's like properly using synchronized or volatile half the time. I don't call it sometimes not working as arbitrary. I call it expected for not following the rules of the system.

If I follow your argument correctly, it is basically the same argument as "your C compiler is correct, what you've written is invalid and the standard allows undefined behaviour here".

Which may be a technically valid argument against the compiler/database system, but it's not a valid argument for defending the system as a whole: if a standard allows arbitrary execution instead of bailing out on non-standard (ambiguous) input, it is unreliable.

Is variable assignment in c/java/... unreliable? It behaves very similar to what C* does. Concurrent access and modification will produce undefined behaviour if you don't explicitly protect it.


Getting access to things like concurrent locks is HARD to get right. That is why there are so many simple languages that don't let you touch concurrency.

Doesn't mean there is no need for it in the world, and no one should be able to use it.

RDBMSes can cause similar inconsistencies if you don't know what you're doing. It is like setting read uncommitted and then complaining about dirty reads.

Let's pretend I'm leading a blind child by telling them which direction they should go, and if they don't step carefully, they could be hurt.

If I can't see the child, should I continue to give them direction, or tell them to stop?

In this use case, the database makes changes to data without knowing what is correct and what is harmful. That is not the user's fault. It's a code choice.

No, it's a users choice to use a DB that has this tradeoff.

Like it was said a couple of time already, all of your complaints about C* would work just as well for locking mechanisms in most popular languages.

This would be a more sympathetic response if Cassandra did less to tempt devs not intimately familiar with it into doing things which will hose them later on. I've worked with it very successfully in the past and contributed to one of its client libraries, but I hesitate to recommend it because that trait makes it unnecessarily dangerous to use.

So, since their use of UPDATE is problematic, what is the correct way to release the locks?


DELETE ... IF <condition>

UPDATE ... IF <condition>

Disclaimer: I am working with the author

The problem is even with using LWTs for both updates you are running into the same problem, the only thing you gain is that one of the statements (either lock or release) wins and in this case it then only works because the lock uses a TTL and is removed after some time.

Also, two conflicting statements from the same thread going to the same node should be easily serializable for Cassandra - or at least be logged.

> Don't take shortcuts and you won't get burned

Since you are the Datastax CTO, maybe some alignment with the marketing team on how features are communicated to users might help users not getting burned? :D

This is too subtle. Incompatible operations should be rejected. Reliance on the programmer to correctly use systems that don't enforce consistency to write consistent transactions is a bad strategy.

This kind of nannying can be really expensive to do correctly when trying to build high performance systems -- to the point where it doesn't make sense to punish everyone just because someone who didn't read the manual MIGHT misuse the product. It's not at all unreasonable to mix transactions with different guarantees if they never touch the same data, and tracking that accurately enough without pissing off your customers with performance needs seems like a fool's errand.

Read the CAP theorem, or stick to your single node.

The situation could be considered similar to .NET adding the IllegalCrossThreadException exception to Windows Forms in v2¹, defaulting to detecting invalid behavior with the option to disable checking (Control.CheckForIllegalCrossThreadCalls).

.NET v4.5 introduced a new approach (Task-based Asynchronous Pattern)² with changes to framework and language (C#5) to "simplify" implementing things correctly.

.NET took the developer-friendly path; Cassandra, not yet.

¹ https://web.archive.org/web/20060414185346/http://msdn.micro...

² http://stackoverflow.com/a/18033198

haha, I was just going to ask you what you thought of this, and I'm glad to see you responded already! Yes! :)

> Don't take shortcuts and you won't get burned.

I'm sorry that the user did something unexpected and then posted about it in a way that made your application look bad. I know that must be frustrating. However...

Calling the user out as doing something wrong when your application is failing because of a use case you can't handle properly just looks bad. You serve your users, not the other way around. Don't forget that.

If it were me and there were a case that my application couldn't handle properly, if I couldn't fix it, I'd raise an error, and then document clearly that they should not do this, such that when they search for that error, they'd find the answer. Then, I'd work to see if there were a way I could avoid the error altogether by not allowing that use case.

There is no "one size" fits all database or distributed system in existence. To say that there is is to say that MySQL is equally as well suited to all use cases as Cassandra or Redis. They all store and serve data, right?

Cassandra makes no claims to be such a holy grail. Read their documentation, and you can see the use cases it is good for and those it is not.

The author of this blog post chose one it is not good for.

Put another way, "I'm sorry that the Lamborghini you bought broke when you attempted to go off-roading with it. Perhaps you should have bought a Jeep instead?"

It's not possible to detect whether the user wants CP. You can assume they want CP, but the entire point of Cassandra is that it doesn't make that assumption.

Cassandra is AP with opt-in CP. This is an explicit tradeoff. You're giving up the assumption (which enables error checking) that everything is CP in order to get AP performance. This tradeoff is one of the main use cases for which Cassandra exists.

The vast majority of the time, error checking is way more valuable than AP performance, so your approach to handling the error makes sense, but if that's your situation you shouldn't be using Cassandra. There are a wide variety of ACID-compliant relational databases that do what you want.

TL;DR: Using Cassandra and expecting CP error checking is like using a hammer and expecting screwdriver behavior.

The high-handedness doesn't improve your argument. Consider forgoing it next time.

As a long time Cassandra user its easy to forget that some of Cassandra's semantics will be surprising to new users. That being said, if you are considering adopting an AP database it really is important for you to know the details about how write conflicts get resolved. This is perhaps the biggest difference between Cassandra other databases like Riak and ought to be part of your decision making process instead of a surprise you run into later.

That being said, using Cassandra for distributed locks is a terrible idea. I can't think of any way in which Cassandra would be better than using {Zookeeper,etcd,consul}. Trying to force a database to do something it really isn't designed for will almost always lead to disappointment (and often resentment) of said database.

Seconded. Don't use an AP system for distributed lock management. As with most things, use the right tool for the right job.

When I speak to developers about Riak I tell them the biggest difference between systems like Riak (including Cassandra) and traditional relational systems is not the data model, ie. relational vs non-relations (or structured vs unstructured) but rather the architecture, ie. distributed vs not distributed. As in a "C" system vs a "non-C" system, where "C" is consistency.

Disclaimer, I work for Basho, makers of Riak.

Minor nitpick: distributed systems can be consistent too. Your comment implies all distributed systems choose availability over consistency.

While they don't all pick one over another, there are limitations on how available and consistent you can make systems, for the same level of partition tolerance - https://en.wikipedia.org/wiki/CAP_theorem

While technically you're right, the CAP theorem uses a very strict definition of availability, which is often not needed to consider the system available in real-life. Also the consistency in CAP theorem is defined as linearizability, and this is just one of very many kinds of consistency. Often the systems are ok with weaker types of consistency.

Therefore, I'd really like that everyone stopped labeling distributed databases as AP or CP, because this is an oversimplification and most of the time, just plain wrong.

I think you can make Cassandra "C" by using quorum reads and writes. Also they support some SERIAL consistency level, which uses PAXOS underneath, and supposedly achieves consistency as well.

The IF statement uses Paxos as underlying implementation and still it may result into an inconsistency if you are using two separate CQL statements. Cassandra does not provide begin ... end construction where everything is atomically committed or rolled back.

You can use Cassandra LWT as a building block for multi-partition transactions, in a similar way like CAS atomic operations are used to build locks, mutexes and monitors and then monitors are used to implement ACID transactions in RDBMS. However, noone says it would be easy or performant, therefore this is probably not a good idea.

Right. Cassandra crawled at 20 TPS when we sent IF cql statements. One thing to note is that Cassandra txns can never be rolled back - they can only be committed/ retrried and so on.

Quorum won't solve consistency problems like this unless you can also guarantee a serialized reader/writer for any given piece of data. The best you can do by pinning queries to primary replicas and enforcing R + W > N is RYOW consistency, and that depends on the aforementioned serialization point.

Quorum reads + writes and applying non-destructive updates only are enough to get linearizabile consistency in Cassandra, even if clocks are not perfectly synchronized.

If you're applying non-destructive updates then you don't even need quorum to achieve consistency. You're just writing immutable data.

If you're applying non-destructive updates without quorum, you would not get monotonic read consistency, because stale reads would be possible. You could write a row, immediately read it back from another replica and find out the row is not there, because that replica didn't get it yet. You'd have only eventual consistency.

Datastax seems to think that it is a good idea: http://www.datastax.com/dev/blog/consensus-on-cassandra

That piece was written in 2014, which was a different time ;)

Plus, if we were to take the vendor words at face value, we'd all be using Docker and MongoDB in production.

I work for a large company that used docker in production. A lot of companies are. There are a lot of big names using Kubernetes, Mesos or some other container orchestration system.

Just like many companies use mongodb and nanoservices.

:) My last client (very, very big client) for the past year ran nearly all their production systems with Docker and MongoDb...

Never had any problems with Docker. MongoDB on the other hand... Suffice to say it is really fragile when spread across multiple datacentres, which is probably not a surprise to HN.

Details matter. All updates in the Datastax post are protected by LWT. That code is correct. The OP's code was wrong, because he was mixing non-LWT updates.

cassandra never claimed to be a consistent distributed database. its really quite sad that someone had to find that out the hard way.

It is optionally consistent if you know how to use it and you know its limitations. Their use of Cassandra was obviously wrong. Mixing LWT and non-LWT is like having only half of your shared data protected by mutexes - this isn't going to work correctly.

In such a situation though, shouldn't it raise an error? At least that way, the user could have a chance to recover.

Why should it make an arbitrary decision that unknowingly corrupts data for some users?

The only way for C* to know that it should raise an error is if it would implicitly protect all writes with LWT and this is not what most users want.

Following the parent's example, if you don't protect memory access with a lock, you can't know that somebody else locked it.

"Must be consistent" is a property of the data, not of a particular transaction. Data on which inconsistent transactions are ever allowed should be declared, preferably with big hazard signs.

The data are not either conistent or inconsistent. The data are either correct or incorrect, and the definition of "correct" is determined by business requirements. Most of the time weak eventual consistency model provided by Cassandra is sufficient to keep data correct if you are using it right. E.g. you don't need ACID and serializable consistency to do financial stuff correctly - banks and accountants figured out how to do this a long time before computers were invented. That's why Cassandra has serializable consistency as an opt-in only (LWTs), but this comes at a price of latency and availability. By using strong consistency all the time, you'd lose most of the benefits of Cassandra, and you could probably just replace it with a single RDBMS node (and suffer scalability and availability problems then).

Right does Java throw an exception if you forget a lock around a shared variable?

> I can't think of any way in which Cassandra would be better than using {Zookeeper,etcd,consul}

This is the correct answer. Zookeeper & Curator make it nearly foolproof to implement a distributed lock correctly, whereas with Cassandra, its the other way round.

> Zookeeper & Curator make it nearly foolproof to implement a distributed lock correctly, whereas with Cassandra, its the other way round.

Cassandra makes it nearly distributed to implement a foolproof lock correctly?

Yes, nearly! But not quite.

what about using redis for distributed locks? (assuming you are not using redis cluster)

You'll want to use a library that implements Redlock: http://redis.io/topics/distlock

You really don't want to use redlock: http://martin.kleppmann.com/2016/02/08/how-to-do-distributed...

Antirez's rebuttle didn't really rebuke martin's overall theory that redlock is a very bad locking algorithm: http://antirez.com/news/101

Did anyone else ever publish a meaningful response? I thought Antirez's response was reasonable but I also recognize the inherent bias. I'd love to see responses from other experts, ideally written such that non-experts can reasonably understand.

I recall several people on twitter (I follow a lot of distributed systems people / cs professors as it is part of my job to build these things) who agreed with Martin's analysis.

I'll have to see if I can dig some of the responses up. It bugs me that I still don't know if this algorithm is actually safe.

Honestly given Antirez's response I'm genuinely surprised no one has written a TLA+ or similar formal verification proof. Then the outcome is binary; either redlock is sound and works as expected, or it is not.

Amazon's AWS Architect, James Hamilton is a big fan of this approach, as are most of the heavyweights in distributed systems:


EDIT: prefaced URL with http

Thanks, I learned the fencing token lock algorithm from this.

I personally wouldn't use redis for anything besides a cache.

Why not? It has a WAL persisted to disk at a desired interval. It provides the same durability levels of most databases.


Let me be clear, I think redis is a well engineered piece of software and I have been constantly impressed with the way antirez runs the project.

Specifically for the question of a locking service, I need more availability than what a single node can offer. I'm not sure if you are advocating for using redis cluster as a locking service, but since it intentionally doesn't offer write safety during a partition that does not seem advisable.


    INSERT INTO locks (id, lock, revision)
    VALUES ('Tom', true, 1)
looks like a race condition. The same problem comes up in SQL databases - you can't lock a row that doesn't exist yet. If you write, in SQL:

    SELECT FROM locks WHERE id = "Tom" AND lock = true AND revision = 1;
    -- if no records returned
    INSERT INTO LOCKS locks (id, lock, revision) VALUES ('Tom', true, 1)
you have a race condition. If two threads make that identical request near-simultaneously, both get a no-find from the select, and both do the INSERT. SELECT doesn't lock rows that don't exist.

The usual solution in SQL is to use UNIQUE indices which will cause an INSERT to fail if the record about to be inserted already exists.

I ran into this reassembling SMS message fragments, where I wanted to detect that all the parts had come in. The right answer was to do an INSERT for each new fragment, then COMMIT, then do a SELECT to see if all the fragments of a message were in. Doing the SELECT first produced a race condition.

Some SQL databases (e.g., SQL Server and PostgreSQL) offer key-range locking which, along with a serializable isolation level, will prevent this race condition.

ON DUPLICATE KEY is another construct that is used in some DB (MySQL or HBase with Apache Phoenix and others).

this is not a race in cassandra.

IF NOT EXISTS causes replicas to agree on the result using PAXOS, and only responds with success if a certain number of replicas concur + write the transaction to disk.

i believe it is a quorum for SERIAL consistency level, and local quorum (quorum of replicas in the local DC) for LOCAL_SERIAL.

And there's another surprise waiting to be discovered. The execution of a LWT is not guaranteed to return applied/not-applied response [1]. It can raise a WriteTimeout exception that means "I don't know if applied". It looks like in that case it can be worked around by inserting a UUID and in case of a WriteTimeout reading the UUID using SERIAL consistency and checking if it's the inserted UUID. But generally this limitation of LWTs makes implementing some algorithms impossible, e.g. you can't implement a 100% reliable counter.

[1] https://issues.apache.org/jira/browse/CASSANDRA-9328

I railed against CQL right from the start and it's precisely because of this kind of thing. Imitating SQL has the side effect of setting certain expectations and drags a certain mental model along with it.

If you're not writing purely immutable data or can't 100% guarantee a serialized reader/writer, then you're just looking for trouble with Cassandra.

The post was quite interesting, but I find image macros and GIFs really distracting in technical writings.

Shouldn't Cassandra be using Lamport timestamps or even vector clocks there? Relying on timer and its resolution sounds strange for a database, especially a distributed one.

Cassandra's timestamps can be specified by the client: it's possible to use any integer as your timestamp, such as Lamport timestamps, or even atomic distributed counters (possibly using Cassandra's own counters).

Vector clocks are still out, though.

Edit: Actually, you can't do Lamport timestamps because you can't query the current value of a timestamp.

Re-edit: That's wrong. I shouldn't believe any old blog I find. (I'm leaving it in the comment because it was quoted in a reply.)

> Edit: Actually, you can't do Lamport timestamps because you can't query the current value of a timestamp.

you can get the timestamp of each column back as part of a SELECT.

is that not enough?

"Conversely, if there are concurrent changes to a single field, only one will be retained, which is also what we want."

They are wrong though. As HN submission illustrates, people want some order and eventual consistency, not a rule to select a single field during concurrent changes. And this is where a Lamport timestamp could help.

No matter what scheme you use, it is still a rule. Lamport timestamps, whatever are just different ways of resolving a conflict. In general, the fact that you have designed your system to have this conflict in the first place is an easier problem to solve than having users understand more complex resolution methods -- especially those that push a callback into the app design rather than actually just solving the basic problem in a pragmatic way.

Author here.

Great discussion around the CAP theorem but it misses the point. AP vs CP / Cassandra being AP is not relevant to this particular problem:

1) This is not a distributed systems corner case. You will run into this if you are running Cassandra on a single node. A node should be able to guarantee consistency internally during normal operation. If it is not able to do that, there is something wrong with the system.

2) This is a case where queries are being send from the same process/thread and go to exactly the same nodes. Attach a simple, monotonically increasing query counter to each call and you can easily serialize it on the other side.

I see this as a basic misunderstanding of how LWT works. If you want to ensure serializable operations, then you need to use LWT with preconditions that ensure serializable operations.

Even better, stop trying to emulate the old and tired distributed lock methods that have been proven over and over again to be insufficient.

If it's not obtained synchronously what does the lock achieve?

Couldn't explain the problem in a less cringy way? Sounds like a bug to me. So file a bug report?

I think not strictly a bug, just surprising behavior. They linked to the documentation that shows that it is "expected" to work like that, and to two requests for features to help mitigate this issue.

bug/feature whatever. CASSANDRA-6123 is exactly addressing row level consistency. but here maybe they found another case.

I guess I'm old or just not hip (most likely both) but I had to google WAT (I know WTF but WAT ... never seen it).

Even now I'm still not sure but I presume WAT = what!

In case this is not the explanation you found: http://knowyourmeme.com/memes/wat

I know it from this presentation: https://www.youtube.com/watch?v=AU2Rhq5eWa4

It is a particular form of "what" that is used when dumbstruck.

Closely related is WUT, particularly the popular variant LOLWUT: http://knowyourmeme.com/memes/lolwut

with a strong subtext of what the fuck / you must be fucking kidding me

Hazelcast has a distributed lock (see http://docs.hazelcast.org/docs/3.7/manual/html-single/index....) and I've used for more than a year to synchronize jobs across a cluster.

Using Cassandra as a lock is a terrible, terrible idea.

With the Oracle/PostgreSQL, readers never wait for writers and writers never wait for readers http://philip.greenspun.com/sql/your-own-rdbms.html using underlying locking mechanism http://www.beej.us/guide/bgipc/output/html/singlepage/bgipc....

It feels like it is 2013 all over again: https://aphyr.com/posts/294-jepsen-cassandra

As others have noted, this blog uses an approach to data modeling that is considered an anti pattern for an AP data store like Cassandra.

Anyone has experience around this on HBase, a CP database?

Apache HBase committer here.

HBase is strictly CP (except for its geo-replication, and optional timeline-consistent region replicas).

It uses MVCC for row "transactions" to always keep rows consistent. HBase also has checkAndPut and checkAndDelete primitives, which are atomic, as well as Increment and Append, which are atomic and serializable.

http://hadoop-hbase.blogspot.com/2012/03/acid-in-hbase.html explains it fairly well.

Together with Apache Phoenix you have full multi-row transactions, but they come with a price obviously.


100,000 nodes on Apple and 2,500 nodes on Netflix would like to have a word with you.

Apple no longer uses Cassandra.

I'm not sure how you got that impression but they still have huge deployments of Cassandra.

Maybe they are thinking of Facebook.

Some details besides ad-hominem might actually be helpful. When you make statements like this it just looks like trolling.

Cassandra is not a person, so that was not ad hominem.

You are technically right (the best kind), but the underlying idea is valid, that this is just cussing out a product with no technical detail, and that's not constructive.

Technically, Cassandra was a person in Greek mythology. So still ad hominem.

You can't comment like this here. Please post civilly and substantively or not at all.


I second other people's responses that you should clean this up and add a lot of substance.

Often the best comments start off exactly like yours: it's clear you feel very strongly about the issue, so edit it and say how/why!! list the points that irked you.

Maybe your startup just sucked at Cassandra?

To be fair, it's easy to suck at Cassandra and most folks that use it used to suck at it.

To be fair, some starups are being trendy and using Cassandra when a traditional SQL database would better fit their needs.

One I worked at used a framework designed for SQL, wrote a Cassandra storage backend, didn't denormalize data at all (!) and then (badly) performed JOINs in software because the framework was built around the storage having relational operations (!!) topped off by their backend implementation not being aware of partitioning keys, and hence using unique values for them (!!!).

The first two I discovered when trying to diagnose why certain parts of our website/API were extremely slow. The third I discovered mid-import when suddenly the (staging) cluster stopped being able to process queries.

There wasn't any problem Cassandra was solving, our read volume was much higher than our write volume (like, 10x-100x), and we didn't even have that much data in the first place. Premature optimization (for a problem we were a long way from having) and being trendy.

I don't think Cassandra is particularly bad (and can be used in awesome ways for high-write, low-read immutable timeseries data) but you have to use Cassandra as Cassandra, not as a trendy MySQL.

Bet that was an optimal solution for the devs CVs and attracting VC money.

Agreed. If Writes >> Reads and TPS > 1M Maybe Cassandra :-)

agreed, cassandra is sharp edges all around.

If you or anyone else enjoys it - ping me via email, I might have a cool job for you.

You should probably put your email in your profile then... ;)

My bad. Added.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact