
Monitoring Cassandra at Scale - jolynch
http://engineeringblog.yelp.com/2016/06/monitoring-cassandra-at-scale.html
======
jolynch
A deep dive into how data is stored on Cassandra rings and how to use this
knowledge to give your operators notice before cluster issues become site
issues.

Any feedback is super welcome. If folks like this we may consider working on
upstreaming the approach into nodetool.

~~~
welder
Have you seen RethinkDB and what you think about it (replacing cassandra?)?

RethinkDB seems to have everything from Cassandra, without the complexity.

~~~
jolynch
To be honest most of my exposure to RethinkDB is from Aphyr's article on it.
We already run basically every alternative he mentions as superior for
particular use cases. For example we run Zookeeper / replicated SQL for inter-
key consistent actions and Cassandra for an AP store. When we need document
semantics we have a pretty robust Elasticsearch setup. It didn't seem like
RethinkDB would do all of those things better than those special purpose
databases so I didn't really look into it too much.

Replacing Cassandra at Yelp would be a lot of effort, so we'd have to be sure
that it's worth it. That being said, RethinkDB definitely looks interesting
and I'll make sure it's on my list of datastores to evaluate.

~~~
superchink
Do you have a link to this article from Aphyr?

~~~
jolynch
[https://aphyr.com/posts/329-jepsen-
rethinkdb-2-1-5](https://aphyr.com/posts/329-jepsen-rethinkdb-2-1-5)

------
spodkowinski
Thanks for posting this, it's an interesting approach. Monitoring the
availability of replicas instead of individual nodes probably makes sense.
However, I'm wondering how this information is actionable for your team. How
would you act differently in case your monitoring reports certain consistency
levels becoming unavailable, compared to just reporting 2 unavailable nodes in
your cluster?

~~~
jolynch
It mostly came down to flexibility. The original form of our monitoring did do
basically what you suggest, "one node failure is ok, two is not", but we found
that was not good enough. That approach in our experience was:

1\. Noisy. We had a lot of large deployments where they had high replication
factors (e.g. RF=5 or 7) and they very much didn't care if 2 nodes failed, or
3, or even 4. They had the high replication factor for resilience to multiple
rack failures and didn't want to get paged by a few racks failing.

2\. Hard to generalize, especially with multi-tenant clusters. Size of cluster
!= replication of keyspaces. For example if we had a 50 node cluster, but had
a keyspace with RF=1, a single node failure should be a pageable event. Why is
there a RF=1 keyspace ... because devops means that developers sometimes do
things like that.

3\. Had poor attribution. If you have a large cluster with many keyspaces, one
of which has a lower RF than the rest or a higher consistency level, then only
the owner of those keyspaces care if we lose a node or two. When we're dealing
with an incident we can rope in the teams owning specifically the keyspaces
that are under-replicated so they can take appropriate action.

To be totally honest, mostly it just helps us find keyspaces that have low RF
... The number of times we found out the new Cassandra version we just
deployed added another system table that had a SimpleReplicationStrategy with
default replication of 2 ...

------
dragonne
Thank you for posting this, and in particular for the pointer to Jolokia. Just
what a Python shop needs to monitor Java infrastructure.

