I realize your previous comment was more in reference to transient network parti...

_benedict · on May 19, 2015

It would be very uncommon to perform a token rebalance (or bring up replacement nodes) under a network partition, since those nodes are fine. The idea is to be tolerant to a network partition, which multi-master is, not to then attempt to patch up a whole new network while the partition is in place. That could easily bring down the entire cluster.

Data is only typically shuffled around when a replacement node is introduced to the cluster.

If you have multiple independent network partitions that isolate all of your RF nodes, then there is no database that could function safely in this scenario, and this has nothing to do with data shuffling.

jedberg · on May 18, 2015

While this is true, I'm not sure why that's a problem. The system can still function while the data is in transit from one node to another. As long as the right 2/3 of the machines are up, the cluster can function at 100%, and as long as the right 1/3 are up, it can still serve reads (assuming a quorum of 3).

YZF · on May 19, 2015

You mean replication factor of 3? It doesn't work like that. If your replication factor is 3 and you're using quorum read/writes then as soon as two machines are down some of the reads and writes will fail. The more machines down the higher the probability of failure. That's why you have to start shuffling data around to maintain your availability which is a problem... (EDIT: assuming virtual nodes are used, otherwise it's a little different)

jedberg · on May 19, 2015

Like I said, it depends on how you lay out your data. Let's say you have three data centers, and you lay our your data such that there is one copy in each datacenter (this is how Netflix does it for example).

You could then lose an entire datacenter (1/3 of the machines) and the cluster will just keep on running with no issues.

You could lose two datacenters (2/3s of the machines) and still serve reads as long as you're using READ ONE (which is what you should be doing most of the time).

YZF · on May 20, 2015

If you read and write at ONE (which I think NetFlix does) then this kind of works. Still with virtual nodes losing a single node in each DC leaves you with some portion of the keyspace inaccessible.

You're susceptible to total loss of data since at any given time there will be data that hasn't been replicated to another DC and you're OK with having inconsistent reads.

That works for some applications where the data isn't mission critical and (immediate) consistency doesn't matter but doesn't for many others. I'm not sure what exactly NetFlix puts in Cassandra but if e.g. it's used to record what people are watching then losing a few records or looking at a not fully consistent view of the data isn't a big deal...

_benedict · on May 19, 2015

If a machine is down (as in the machines themselves are dead) then you absolutely need to move the data onto another node to avoid losing it entirely. There is no way to avoid this, whatever your consistency model.

If there is a network partition, however, there is no need to move the data since it will eventually recover; moving it would likely make the situation worse. Cassandra never does this, and operators should never ask it to.

If you have severe enough network partitions to isolate all of your nodes from all of its peers, there is no database that can work safely, regardless of data shuffling or consistency model.

YZF · on May 20, 2015

It can be hard to tell if the machine is down as in permanently dead including the disk drives or if it's temporarily down, unreachable or just busy. Obviously if you want to maintain the same replication factor you need to create new copies of data that is temporarily down to a lower replication number. In theory though you could still serve reads as long as a single replica survives and you could serve writes regardless.

You can kind of get this with Cassandra if you write with CL_ANY and read with CL_ONE but hinted handoffs don't work so great (better in 3.0?) and reading with ONE may get you stale data for a long while. It would be nice, and I don't think there's any theory that says it can't be done, if you could keep writing new data into the database at the right replication factor in the presence of a partition and you could also read that data. Obviously accessing data that is solely present on the other side of the partition isn't possible so not much can be done about that...