Hacker News new | past | comments | ask | show | jobs | submit login
Cassandra: Daughter of Dynamo and BigTable (insightdataengineering.com)
81 points by aouyang1 on May 18, 2015 | hide | past | favorite | 47 comments



What's interesting to note is that Dynamo/Cassandra usage was killed at both Amazon and Facebook (DynamoDB from AWS is actually not based on Dynamo tech except in terms of what not to do).

[1] - https://www.facebook.com/notes/facebook-engineering/the-unde...

edit: Added link to facebook post on hbase vs cassandra


Damien Katz (one of my cofounders at Couchbase and therefore has a dog in the fight) wrote this take-down of the Dynamo model, for folks wondering why?

http://damienkatz.net/2013/05/dynamo_sure_works_hard.html

TLDR quote:

The Dynamo system is a design that treats the probability of a network switch failure as having the same probability of machine failure, and pays the cost with every single read. This is madness. Expensive madness.


I don't agree with his fundamental premise:

> Network Partitions are Rare, Server Failures are Not

Network partitions happen all the time. Sure, the whole "a switch failed and that piece of the network isn't there anymore" doesn't happen a lot, but what does happen a lot is a slow or delayed connection, or a machine going offline for a few seconds.


Indeed. We have some external servers in a partner's datacenter which runs under VMware's vMotion. Every now and then vMotion will shuffle a VM to another physical server, causing the entire OS to freeze for several (often 20-30) seconds, and everything that is partition-sensitive, like RabbitMQ and Elasticsearch, throws a tantrum and keels over.

Even VMs on more statically allocated clouds like DigitalOcean and AWS will experience small, constant blips that affect your whole stack.

What annoys me in particular is that these blips affect everything. Every app needs to fail gracefully, be it a PostgreSQL client connections, a Memcached lookup or an S3 API call. The fact that such catch-and-retry boilerplate logic needs to built into the application layer, and every layer within it, is still something I find rather insane. It leaks into the application logic in often rather insidious ways, or in ways that pollutes your code with defenses. Everything has to be idempotent, which is easy enough for transactional database stuff, less easy for things like asynchronous queues that fire off emails. Erlang has already provided a solution to the problem, but I suspect we need OS-level support to avoid reinventing the wheel in every language and platform. /rant


Our customers tend to be the kind who need extreme performance, so they aren't spanning cluster across WANs. For well-tuned datacenters rack awareness (putting the replicas in sane places), is more useful.

For WAN replication we have a cross-datacenter replication which works on an AP model.


To be clear, I was in no way making a judgement about CouchDB vs. Cassandra. I've only give Couch a cursory glance so I wouldn't be qualified to make such a judgement.

I was simply trying to point out that while you may have a very good argument as to why Couch is better, the network partition argument is not sound, and you may want to look for a better argument to make.

I'm personally against single masters because they are SPOFs. With a master, at some point there needs to be a single arbiter of truth, and if that is unavailable, then the system is unavailable.


A nitpick, but an important one which I wish the Couchbase folks wouldn't let slip as often as they do, CouchDB has very different properties from Couchbase and should be considered entirely different database designs regardless of the availability of a sync gateway for replication with a number of JSON stores.


Different database designs, but similar document model. In fact, Couchbase Sync Gateway is capable of syncing between Couchbase Server and Apache CouchDB. Also our iOS, Android, and .NET libraries can sync with CouchDB and PouchDB. Everything open source, of course. More info: http://developer.couchbase.com/mobile/


It's not that it can sync, nor the data model.

It's that these are fundamentally very different databases with different trade offs. You can't just take one and adjust some API calls and expect things to work in a similar way. It only confuses people when it's quietly ignored and others assume that since it wasn't pointed out to be wrong that it must be the same thing.

I've had far too many conversations with people who use Couchbase that can't tell the difference that I would say that it's just general confusion. It's lax work on Couchbase's part and a thorn in the Apache CouchDB project that there is no effort to help clarify the fact that they are indeed independent and now very different databases.


Exactly. Couchbase trades some availability during rebalance for more lightweight client interactions. As Damien argues in his post, this allows it to meet the same SLA with less hardware.


Cassandra has rack awareness...


Really it doesn't matter and your point only goes to show one of Cassandra's flaws. When either a network partition or a server failure happens Cassandra starts reshuffling data amongst multiple hosts and filling network pipes. Contrast this with a setup where you have a static partitioning of hosts to partition and a leader per partition. Then you only need to (possibly) elect a new leader and carry on.

This is especially relevant when you need to do these things because of unexpected load increases or the loss of hosts in your cluster.


That's incorrect. Data doesn't shuffle on a partition unless you do it manually.


I realize your previous comment was more in reference to transient network partitions so my comment is out of place. But whether the mechanism is manual or automatic once a network partition is discovered the reshuffle begins.


It would be very uncommon to perform a token rebalance (or bring up replacement nodes) under a network partition, since those nodes are fine. The idea is to be tolerant to a network partition, which multi-master is, not to then attempt to patch up a whole new network while the partition is in place. That could easily bring down the entire cluster.

Data is only typically shuffled around when a replacement node is introduced to the cluster.

If you have multiple independent network partitions that isolate all of your RF nodes, then there is no database that could function safely in this scenario, and this has nothing to do with data shuffling.


While this is true, I'm not sure why that's a problem. The system can still function while the data is in transit from one node to another. As long as the right 2/3 of the machines are up, the cluster can function at 100%, and as long as the right 1/3 are up, it can still serve reads (assuming a quorum of 3).


You mean replication factor of 3? It doesn't work like that. If your replication factor is 3 and you're using quorum read/writes then as soon as two machines are down some of the reads and writes will fail. The more machines down the higher the probability of failure. That's why you have to start shuffling data around to maintain your availability which is a problem... (EDIT: assuming virtual nodes are used, otherwise it's a little different)


Like I said, it depends on how you lay out your data. Let's say you have three data centers, and you lay our your data such that there is one copy in each datacenter (this is how Netflix does it for example).

You could then lose an entire datacenter (1/3 of the machines) and the cluster will just keep on running with no issues.

You could lose two datacenters (2/3s of the machines) and still serve reads as long as you're using READ ONE (which is what you should be doing most of the time).


If you read and write at ONE (which I think NetFlix does) then this kind of works. Still with virtual nodes losing a single node in each DC leaves you with some portion of the keyspace inaccessible.

You're susceptible to total loss of data since at any given time there will be data that hasn't been replicated to another DC and you're OK with having inconsistent reads.

That works for some applications where the data isn't mission critical and (immediate) consistency doesn't matter but doesn't for many others. I'm not sure what exactly NetFlix puts in Cassandra but if e.g. it's used to record what people are watching then losing a few records or looking at a not fully consistent view of the data isn't a big deal...


If a machine is down (as in the machines themselves are dead) then you absolutely need to move the data onto another node to avoid losing it entirely. There is no way to avoid this, whatever your consistency model.

If there is a network partition, however, there is no need to move the data since it will eventually recover; moving it would likely make the situation worse. Cassandra never does this, and operators should never ask it to.

If you have severe enough network partitions to isolate all of your nodes from all of its peers, there is no database that can work safely, regardless of data shuffling or consistency model.


It can be hard to tell if the machine is down as in permanently dead including the disk drives or if it's temporarily down, unreachable or just busy. Obviously if you want to maintain the same replication factor you need to create new copies of data that is temporarily down to a lower replication number. In theory though you could still serve reads as long as a single replica survives and you could serve writes regardless.

You can kind of get this with Cassandra if you write with CL_ANY and read with CL_ONE but hinted handoffs don't work so great (better in 3.0?) and reading with ONE may get you stale data for a long while. It would be nice, and I don't think there's any theory that says it can't be done, if you could keep writing new data into the database at the right replication factor in the presence of a partition and you could also read that data. Obviously accessing data that is solely present on the other side of the partition isn't possible so not much can be done about that...


We don't even need to agree or disagree. There are numerous studies on how often network partitions happen. Aphyr & Bailis paper 'The Network Is Reliable' [1] has a very detailed overview of these studies.

[1] https://aphyr.com/posts/288-the-network-is-reliable


> but what does happen a lot is a slow or delayed connection, or a machine going offline for a few seconds.

This is especially true for cross-datacenter rings across the public internet.


That is potentially a partition. Anything that violates the SLA is a an effective partition.


Seriously HN, that is fully on topic and did not deserve a down vote.


I think that's a simplistic analysis. What you want really depends on the workload characteristics. If you want an immutable data store with high write performance but infrequent reads the Dynamo model works quite well. Writes can be fast with a small quorum (you don't need to persist to disk immediately as you're relying on multiple machines not going down) and read consistency is not usually a huge issue for analytics tasks (who really cares if you miss an event in your processing?) and if it is you can usually afford to wait for quorum reads.


True. If you aren't doing interactive workloads you have a much freer hand.


The redundant work has a drastic improvement on tail latency. This is both true in a health cluster but even more true once even small failures or fluctuations occur. You might be able to tune your network and your software to _minimize_ observed partitions but you are very unlikely to see a system free of fluctuations outside of the hard-realtime system construction which none of the above qualify for by a very long shot.

It's a heavy price, sure, but in return you'll be able to round off the tail in most scenarios. This is a trade off that many don't properly consider when they try to focus on that mean performance while they miss part of the point of the dynamo model which helps provide better guarantees about how things perform in more cases, including highly tuned clusters with very few major failures.


Reads absolutely must go through the consensus system (whatever it may be) if you make any guarantees against stale reads. Yes it's expensive, but that's why there's no such thing as a magical distributed database that scales linearly while providing perfect consistency guarantees all the time.


Both Dynamo and Cassandra are written in Java. Are the replacements still Java?

At least Google's counterparts seem to be written in C++. Such as BigTable and GFS. Presumably also Spanner is C++.

In addition to C++, especially considering its less good safety/security record, Rust, Golang and Nim should be interesting alternative, safer, implementation language choices. In contrast to idiomatic Java, those languages provide significantly higher CPU cache hit rate for internal data structures due to no value boxing [1] and ability to reliably reduce problems like false sharing [2].

1: http://www4.di.uminho.pt/~jls/pdp2013pub.pdf (These issues can be worked around in Java by abandoning object orientation and instead having one object with multiple arrays (SoA, structure of arrays). In other words, not List or Array etc. of Point-objects, but class Points { int[] x; int[] y; ... } that contains all points.)

2: http://mechanical-sympathy.blogspot.com/2011/08/false-sharin... (False sharing performance issues)


In what way is Rust, Golang, Nim safer than Java ? Rust I understand provides some nice semantics for thread safety but these exist in Java as world.

And I don't understand why anybody should care about CPU cache hit rate. The bottleneck is always going to be in the I/O pipeline. And Java is faster than C++ and vice versa in various situations.


Rust, Golang and Nim are safer than C++. Sorry that I didn't express it clearly enough; I considered the safety issues to be rather obvious.

> Rust I understand provides some nice semantics for thread safety but these exist in Java as world.

How do you get Java to fail compiling if thread safety constraints are not met? I'd be interested to try it out!

One should definitely care about cache hit rate, because it significantly affects runtime performance. There are just 512 L1D cache lines per CPU core.

I/O is the bottleneck? It is becoming less so, one of the few areas where there's actually some nice progress happening. PCIe SSDs are up to 1.5 - 2 GB/s (=up to 20 Gbps). More and more servers have 10 Gbps or 40 Gbps networking.

Sure, Java is faster when it can use JIT to prune excessive if-jungle, aggressively simplify and inline and adapt to running CPU. But memory layout control is where Java is rather weak. The problem is getting only worse, because the gap between CPU and memory performance is only widening year by year. Memory bandwidth is increasing slowly and latency hasn't improved for a decade.

C++ is going to be always faster especially if specialized to certain machine and use case. C++ is also going to win by a large margin when there's auto-vectorizable code or heavy use of SIMD-intrinsics. 10x is not unusual, if the problem maps well to AVX2 instruction set.


That is a good thing. It does not mean that Dynamo and Cassandra are bad choices, but that companies and teams can move onto other solutions when their requirements changes. In an Open Source model just because you invented it does not mean you are stuck with it.

Not sure I could work in a one product company where you have to eat your own dog food in all situations even where it really is not suited for.


Hulu uses Cassandra heavily, as does Netflix.


Apple has 70k Cassandra nodes, if you send an imessage it's probably going into C* at some point.


From what I heard Sony PSN uses Cassandra and EA uses it for many of their online services.

It really is the easiest database by far to scale that I've worked with.


> What's interesting to note is that Dynamo/Cassandra usage was killed at both Amazon and Facebook

IMO, the only way use, or former use, is interesting is in understanding why a specific company moved to, or away from, a given technology.

Specifically, what was the original use case that was the basis for original use?

Why did the company choose to change technology? Did the use case change? Did the technology fail to satisfy the original requirements? Did a new technology with substantially better capabilities emerge? And so on?

Simply stating that company X uses Y (or company X no longer uses Y) does not provide a lot of information that other companies can use as the basis for their decision.


It's not so simple. The main complaint I have heard is that programmers find it difficult to deal with eventual consistency exposed through vector clocks. It's only recently that CRDTs have been reasonably well known, and they solve this problem. Riak 2.0 includes a CRDT library but it might be too late.

Note that Cassandra doesn't actually handle eventual consistency properly, and has weird corner cases a result (e.g. it's infamous "doomstones"). As an immutable data store it works very well, particularly when you have a high write load.


I found Riak's CRDT implementation disappointing. They require that you define them beforehand in a schema, which defeats much of the point of using a schemaless database in the first place.


What did Amazon move to? I wasn't aware they moved away from a Dynamo model.


Cassandra is a pretty good NoSQL database. I'm a big fan of it.

Of course if your data change often and very quickly you shouldn't be using Cassandra. You'll end up tombstone death. Deletes are not real delete, they're soft and just have a timestamp that eventually will be deleted (tombstone). Every delete creates a tombstone, if you're hashkey/column key have 50 tombstones, it have to go through those tombstones before getting the values. The reason is some trade off for faster write but shitty read if you update a lot of the same key.

Overall, one way of thinking of it is a souped up hashkey db that can do some relational queries which is much more than the usual hash key noSQL. Seeing how it's a hash key type database you can see the trade off of Cassandra and it's siblings versus say MongoDB or CouchDB.

I used it for time related stuff that has simple data. It's mostly immuatable so cassandra was perfect for it. Example is a daily tv show release date.


Increasing the size of memtables should reduce the tombestone overhead since entries will get overwritten in memory.


Sometimes I feel that the Cassandra row length limit is not taken seriously enough. If you have some high frequency data it will not take too long until you hit the limit. Also, the guide states that you should not really attempt to get it that far. I'm not sure if schemes like the one in the example are future-proof. In a couple of years developers will start hitting the limit and we'll see mailing lists with postmortems :) . A bit apocalyptic, but I believe we should find an idiomatic solution for it. I use rounded time periods as a part of the partition key. Yes, it makes the data a bit more complicated to query, but I have not thought about a better solutions.

Time series databases like KairosDB are quite good, but for simpler data structures and something describable as a metric. Also you may face issues introducing relatively less known software, and get locked in, in your company.


Pretty interesting read, the concept of tunable consistency was foreign to me, keep on the good work !


Riak has the same sort of concept: http://docs.basho.com/riak/latest/theory/concepts/Eventual-C...

I think Riak is richer in this area than Cassandra, because if the inconstancy can not be resolved, Riak can keep both versions and let you deal with it at the application level.

See also: https://aphyr.com/posts/294-call-me-maybe-cassandra


You may be interested in the Dynamo paper, which is what introduced it to me: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp...


Article isn't loading.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: