
Cassandra Performance - mfiguiere
http://www.datastax.com/dev/blog/2012-in-review-performance
======
endymi0n
Cassandra is ugly, hardcore and performant as hell. It's not meant for the
casual user, it's really meant to be there for you at scales where MongoDB
craps its pants. If you wrap your head around ColumnFamilies, tunable
consistency and NetworkTopologySnitch strategies, you get rewarded by a
database that can scale on a global level to millions of I/O operations per
second. We at Trademob have chosen Cassy as the backbone of our tracking
platform and couldn't be happier. It's pretty serious stuff though and nothing
for a quick prototype or the first few iterations of a product IMHO.

~~~
taligent
Sorry but this is just nonsense.

Cassandra with PlayORM/Astyanax has been the easiest database for me to
install, use and manage out of the 10+ I've tried. Far simpler to
install/manage than MySQL Cluster or Riak, far easier to use than PostgreSQL
and infinitely better to scale than MongoDB.

You don't have to understand ColumnFamilies, consistency or the different
topology strategies. The defaults are fine and if you are a Java developer
life couldn't possibly be simpler.

~~~
3amOpsGuy
If you don't understand the implications of eventual consistency, you're
heading for a fall.

It's not a trivial topic and unfortunately "it appears to work as you'd
expect" on a small dev cluster which can lead to statements like yours.

Your parent's post is actually very very accurate.

~~~
taligent
I am not disputing that Cassandra has a learning curve but I just disagree
that it is any different to every other database available today.

They ALL have issues and eventual consistency is a fundamental part of a
distributed database so its something you have to learn either way.

~~~
rescrv
Check out HyperDex, Hbase and BigTable for systems that provide better
guarantees than "eventually."

~~~
tylerhobbs
You would need to define "consistency" in order to have a more reasonable
discussion about what each system provides, but Cassandra certainly isn't only
eventually consistent. You can choose the number of replicas that must respond
in order to consider the read/write operation a success per operation, which
allows you to have quorum-based strong consistency guarantees.

There are more details on the options here:
<http://www.datastax.com/docs/1.2/dml/data_consistency>

~~~
rescrv
Consistency is a safety property. HyperDex, Hbase and BigTable all provide
linearizability, which has a well-defined meaning. Cassandra does not, and
most of its descriptions of consistency only refer to the behavior of the
system, and not the properties you can rely upon. Pointing to the number of
replicas read or written only clouds the issue.

~~~
crypto5
I think if you will write to Cassandra with consistency level ALL, you will
get strong consistency. Or you can use write consistency = 1 and read = ALL,
or write consistency = Quorum and read consistency = Quorum.

------
otterley
> A log-structured engine that avoids overwrites to turn updates into
> sequential i/o is essential both on hard disks (HDD) and solid-state disks
> (SSD). On HDD, because the seek penalty is so high; on SSD, to avoid write
> amplification and disk failure. This is why you see mongodb performance go
> through the floor as the dataset size exceeds RAM.

The structure of MongoDB's on-disk data has nothing to do with why its
performance starts to falter when the dataset size exceeds RAM. It falters
because each node mmap(2)s its dataset into MongoDB's process space and relies
solely on the kernel's buffer caching algorithm to determine which pages to
cache. The buffer cache is general-purpose, shared with every other process
running on the node, and isn't finely tuned (or tuneable, for that matter) for
database workloads, in which a basic LRU would be too naive. This is why
MySQL, for example, doesn't mmap its tablespaces - instead, it's typically
configured to manage its own buffer pool, and to avoid double buffering,
O_DIRECT semantics are used for disk I/O.

~~~
rogerbinns
There is a secondary problem with using mmap. You don't know if an access will
take a page fault. When thrashing starts happening MongoDB doesn't throttle
new queries. New queries coming in then add fuel to the fire making existing
queries take longer and longer. This causes a huge and rapid performance
dropoff.

Of course not all new queries will cause paging so they could be left
unthrottled. There is a system call mincore that will tell you if pages will
take faults but it doesn't support scatter/gather and has race conditions
especially when there is lots of paging!

I did report this at the beginning of 2010 - currently marked as major
priority, planned but not scheduled:
<https://jira.mongodb.org/browse/SERVER-574>

That said MongoDB is still my first database of choice. Nothing beats
arbitrary JSON in, the same JSON back out.

------
linuxhansl
HBase looks bad in some of these benchmarks, because it is hard to setup and
has many tuning knobs to be tuned correctly for the workload in question.

Due to its strictly consistent nature you have to think about key design,
hotspotting of servers, etc, etc. In return you get correct atomic operations,
row transactions, range scans by default (Cassandra uses a random partitioner
by default not allowing range scans), etc, etc.

Some of the largest installations on this planet run on HBase. For example,
FaceBooks HBase stats at HBaseCon (May 2012): Billions of msgs/day, 75Bn
ops/day, 1.5M ops/sec peak. 250TB new data/mo and growing. (Facebook also
created Cassandra, but is not using it)

As usual you use the right tool for the job and isolated benchmarks usually do
not bear this out.

~~~
jbellis
It's worth noting that the FB HBase install is also sharding across multiple
sub-clusters because of the HDFS namenode SPOF problems [1].

Personally, if I'm going to shard manually I'll stick with postgresql. One of
the primary reasons to use something like Cassandra is that it solves that for
you.

[1] [http://www.slideshare.net/brizzzdotcom/facebook-messages-
hba...](http://www.slideshare.net/brizzzdotcom/facebook-messages-hbase/23)

~~~
justin_hancock
My understanding of facebooks Pod Architecture for HBase was not the name node
but simply scaling HBase, HBase gets rather unpleasant at facebook scales. The
facebook HBase fork has things like compactions disabled to improve
performance.

I ran a HBase cluster with 1PB storage, it became very unwieldy at this scale,
thousands of regions and lots of tricks to keep it happy. As for SPOF, the
name node now has HA and it works very well.

~~~
linuxhansl
Interesting. Do you remember what kind of problems you ran into and what
version of HBase you used?

~~~
justin_hancock
HBase 90.4, problems with I/O we had a very heavy read load on top of a write
load, write load bursting 14,000 TX per second, and an average of 8,000 per
second - each record around 2k.

Because of the I/O the WAL had to be turned off, this introduced problems when
Region Servers occasionally died. Implementation of large regions 10GB, and
fairly large HBlocks 512MB, increasing flush sizes to reduce minor
compactions. Use of MSLAB to virtually eliminate GC all together, use of large
heap 12GB on RS.

Worst problems we experienced was META corruption, that really , really
sucked.

~~~
linuxhansl
Thanks. If there's a more detailed writeup you can point me to that'd be
great. I would like to make sure then that all these issues are addressed in
the current versions.

0.94+ has MSLAB by default, with HFileV2 (0.92+) we can support much larger
regions (20G or bigger). Curious about the 512M HBlocks, did you have scan-
heavy read-load?

14k TX peak per regionserver? x 2k that's 28M/s (56M with WAL). Should be
doable now even with WAL (definitely with deferred flush). Well, maybe not
with concurrent very heavy read load, depending on disk configuration.

Probably on top of Hadoop 0.20-append? Hadoop-2.x.x should be far better too.

------
ddlatham
The benchmark referenced in this post was previously discussed here:
<http://news.ycombinator.com/item?id=4453500>

------
ogrisel
Has anybody found / done a benchmark that would compare the scalability of a
Cassandra cluster vs an ElasticSearch cluster the latter used as a NoSQL
database (with stored fields)?

I am interested with 2 kinds of scalability:

\- volume scalability with single concurrent user: average read / write query
times vs stored-data and indexes size vs number of EC2 nodes \- concurrency
scalability with a fixed size database: average read / write query times vs
number of concurrent users vs number of EC2 nodes

------
mnutt
While Cassandra has some nice characteristics, there are a few things I've run
into along the way.

Don't expect to run a 3-node Cassandra cluster and get much out of it in terms
of availability, in the way you might run a master/slave failover setup. It's
somewhat obvious, but your Cassandra deployment can't just start with a couple
of nodes and scale up as you run into bottlenecks. The number of nodes needed
starts to add up quickly with a replication factor of > 1 and quorum reads.
And while you might say "I'm ok with eventual consistency, let's just read
from a single node," if you're not reading from multiple nodes, the data may
_never_ become consistent, from what I can tell.

And counters should be marked with a big warning "not for production use".
Their performance isn't great, and it nosedives as the dataset grows. (each
counter update involves a read + a write) Having a node reboot can sometimes
cause counters to double. They seem like basically an afterthought.

~~~
jbellis
Post author here.

Your first paragraph is, bluntly, incorrect. Cassandra guarantees that data
will always become consistent. This is automatic [1] for normal operation,
including in the face of temporary failures. Permanent failures require
running a "repair" process to rebuild the failed machine from other replicas
[2].

I think you've also misunderstood how quorum works; it is a quorum of the
_replica count_ , which tends to stay constant over cluster lifetime, not
_machine count_.

You are right that the current counters are an afterthought. I linked in my
concluding paragraph, where I talk about improvements for Cassandra, "A new
design for distributed counters." [3]

[1] <http://www.datastax.com/dev/blog/modern-hinted-handoff> [2]
<http://www.datastax.com/docs/1.2/operations/node_repair> [3]
<https://issues.apache.org/jira/browse/CASSANDRA-4775>

~~~
jmix
False -- if there are nodes being added or deleted from the system, Cassandra
provides no guarantee of consistency. Two nodes might disagree on quorum
membership and thus quorum accesses may fail to overlap, leading to
inconsistency.

The consistency claims are overblown.

~~~
parasubvert
Please enlighten us with a cluster database that enables guaranteed
consistency with dynamic node membership. They all have quirks handling
membership (unless you're looking at a shared disk setup).

~~~
rescrv
Checkout HyperDex. We just released a new version, and it is indeed consistent
as nodes join and fail.

------
sturadnidge
Not trying to take anything away from Cassandra (or any of the other products
mentioned), but I would have liked to see the article focus on the actual data
presented rather than a somewhat speculative discussion about 2 products that
were not evaluated in the referenced study.

Unless I'm missing something.

------
pothibo
I'm not sure what's the point comparing benchmarks this way.

Choosing a database is not only about performance, it's about the type of
application you are building, the stage it's in (prototype product doesn't
have the same need as a product that has grown over 5 years).

It's also about the people that works on the project. Some projects are better
handled in a specific language (ruby/java/php,asp.net, etc.)

For example, using MongoDB on a ruby stack to build a prototype is a pretty
good choice. Moving some loads off mongo to redis would be a solution later
on. And eventually, the need would arise to migrate your mongoDB stack to
Cassandra.

~~~
abolibibelot
I'm not sure switching to a document model with multiple indexes (MongoDB) to
a key/value store (Redis) is something that can be done easily "later on".

~~~
pothibo
Well structure evolves over time. Probably the structure would move and you
would use redis as a memcache layer (to update counts, notifications, etc).
What I'm saying is that databases needs evolves over time.

Comparing in-memory storage with SQL and NoSQL isn't a useful and misses the
point.

------
chetanahuja
Well it's a cassandra company so it's bound to exaggerate the throughput test
that shows Cassandra winning but relegating its huge weakness in latency
performance (Whoa!! 10ms average read latency from in-memory store...) to an
"area for improvement" list at the bottom is a bit disingenuous. It's not a
small issue. It's an order of magnitude difference from Voldemort, redis and
even mysql at scale.

------
MichaelGG
The big caveat in their usage of VoltDB is that they are apparently using a
synchronous client, waiting for a response each time, instead of async
streaming. They mention this briefly in passing at the bottom of the paper,
and say the VoltDB people were able to see performance increase by using an
async client.

------
aoprisan
Where's the MongoDB comparison? They mention it but don't see it in their
graph

~~~
corresation
Aside from MongoDB not being a part of the subject study, it's also worth
mentioning that they cherry picked the example that made Cassandra look
particularly good. MySQL actually did extremely well on the non-scan tests,
while offering consistency. It depends upon your usage.

~~~
rescrv
When benchmarking HyperDex we found something similar. Cassandra is better at
writes than reads by a surprising amount. Given Facebook's elaborate caching
architecture (which likely pushed for such an inversion) this makes sense.

------
ddorian43
Poor hypertable is never included in nosql benchmarks

------
meh01
Serious question: are people really still using Cassandra?

I've only ever heard horror stories about big deployments, and the only posts
about it come from DataStax.

~~~
henrikschroder
Netflix is probably the most well-known large user currently.

~~~
meh01
I don't know if this is heretical to say, but when I think about services that
people should look up to in terms of architecture, I don't think of Netflix.

See all the downtime they have despite the 1000 posts on their blog about how
wonderfully available their architecture is.

I can point to 10 other sites running on a boring LAMP stack with similar
availability.

~~~
tadfisher
And if you examine the causes of Netflix' downtime, is it because of their
usage of Cassandra? Would a LAMP service running on the same AWS ELB nodes
have avoided said downtime?

------
dschiptsov
What is memory overhead? How much memory Java processes consume comparing to
the amount of data a node could handle, assuming there _must be no swap_
(otherwise we all know what happens to _any_ Java process).

~~~
chetanahuja
It seems that question has almost completely fallen off the radar today. I've
experienced medium sized voldemort clusters eating up huge amounts of extra
RAM (of the order of 100% overhead[1]) to avoid falling into pathological GC
patterns over long runs. Actually I shouldn't really single out voldemort. The
problem is java.

Java[2] is a terrible platform to write large in-memory caching servers on.
The write and access patterns are a complete mismatch for the assumptions made
in the generational GC algorithms that most current JVM's sport. Most caches
will evict on an LRU basis, which means that almost all allocations will end
up in the old generation heap before finally being evicted. Which is precisely
the counter-optimal case for the basic assumptions that the generational GC
model relies on (that most objects are short-lived and get swept while still
in the "young" heap (which is ultra cheap).

Footnotes: [1] "overhead" here means precisely how the parent post defines it.
[2] more precisely, the commonly used freely available JVM's that most shops
use. There might be better GC implementations (e.g. as claimed by azul) but I
don't have any direct experience with them.

~~~
henrikschroder
We got bit by GC issues with our Cassandra cluster, and we had to completely
re-design a column family to fix it.

It's pretty telling that the development community is moving as many memory
structures as possible outside the java heap, each new major release has moved
some piece or other.

The biggest threat I see to Cassandra is that java in the end won't cut it,
that the JVM will limit its performance too much, allowing a competitor to
surpass it. Stop-the-world GC pauses are not something you want in a high-
performance database solution.

~~~
chetanahuja
I was in a discussion with a member of the Go development team, bitterly
complaining about their decision to go with a GC'd heap as the only way to
access memory for something they intended as a "systems programming language".
They suggested I link in C data structures for those heap-heavy caching
applications :-( As I see it, C and C++ are the only practical options for
writing high performance, memory efficient, cache heavy applications for
production use in the current tech climate.

