

NoSQL databases benchmark: Cassandra, HBase, MongoDB, Riak - teoruiz
http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html

======
karterk
I was part of a large migration that moved a significant amount of data from
MongoDB to HBase recently. Along the way, I also spent significant time
digging into Cassandra and Riak.

I appreciate the effort and intention behind this article, but for all
practical purposes, such numbers are really not helpful. From experience, if
you really want performance from HBase, you need to spend significant amount
of time coming up with the right way to structure your data in HBase. To name
a few:

* choosing the right row key that optimizes bulk scans

* setting optimum client and server caching based on the size of each row

* pre-splitting regions, and setting custom region sizes

You will also run into various cluster-related issues. Things don't really
scale linearly as you add more nodes. You need to also consider maintenance,
upgrades, backups, replication and so on.

If you want to choose a NoSQL (or for that matter any database), spend some
time thinking about whether it fits your data model and your own understanding
of the technology. Performance is rarely gained by simply switching a few
knobs.

------
jbellis
Note that the HBase numbers are so good because the client wasn't actually
hitting the database: [http://brianoneill.blogspot.com/2012/10/solid-nosql-
benchmar...](http://brianoneill.blogspot.com/2012/10/solid-nosql-benchmarks-
from-ycsb-w-side.html)

~~~
knightni
Just came here to post this.

It's frustrating, because a configuration like that makes the numbers
completely irrelevant, unless you don't care whether your data gets persisted
or not. I'd much rather see a more like-for-like comparison, and perhaps an
unsafe configuration included for interest's sake.

~~~
linuxhansl
Nobody in their right mind hits a key value store for each single key value
(an RPC for each key value). You'd be limited by the network latency.

You batch your changes in the client and then you commit them to the server.
When that commit succeeds your data is persisted; until such time any client
application must consider the data uncommitted. This is not unlike a client
application would deal with transactions to a relational database. So the
benchmark is completely valid here.

(Well, only that here we also have deferred log flush, which means the changes
will take a second or two to actually make it to the file system, but that
still demands the same aggregate throughput from the HBase. That is where this
test is not fair).

Edit: Spelling

~~~
knightni
> (Well, only that here we also have deferred log flush, which means the
> changes will take a second or two to actually make it to the file system,
> but that still demands the same aggregate throughput from the HBase. That is
> where this test is not fair).

That was actually the bit that I was concerned about - I failed to read the
parent comment properly and assumed it had the exact same concern I did. If
you don't have to do a disk write because you're not guaranteeing durability,
you have a completely unfair comparison.

------
jegoodwin3
Not sure I buy the paper's analysis. They don't seem to know much about load
testing -- they speak as though throughput is 'load' (rather than the
multiprogramming level) and don't computer Bandwidth - Delay products.

Also, rather than using M/M/1 or some reasonable analytic model, they
deliberately trottled their request rates to hold throughput constant (thereby
guaranteeing different loadings for different 'benchmarks')

Just reading the first graph, for example, and applying Little's Law, it's
pretty evident that Cassandra was loaded more heavily than than the two MySQL
systems, with HBASE and Riak trailing.

Looks like HBASE and Cassandra lead the pack to me, with different
characterists for different purposes.

Advice to authors: buy a book by Neil Gunther.

------
jbellis
A more rigorous set of tests (including datasets that don't fit in memory, for
instance) was presented at VLDB this year:
<http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf>

~~~
Mysterion
Note that this paper was discussed at the time on HN:

<http://news.ycombinator.com/item?id=4453500>

Some people were skeptical of the tuning for HBase, and an HBase developer
criticized their methodology, given the especially poor performance shown for
HBase. The paper authors themselves indicated that they had considerable
trouble setting up the HBase system. Which in a way validates the comments of
some here that HBase is fickle and can be difficult to tune, though it still
means that the results in this paper for HBase are not valid, when compared to
the other systems.

In the interest of disclosure, it should be mentioned here that jbellis is a
principle corporate backer of Cassandra.

------
HarrisonFisk
I wonder why they picked MyISAM for MySQL, since basically no one sane is
using that anymore. It isn't even the default anymore, so that can't be the
excuse.

"MyISAM caches index blocks but not data blocks."

~~~
elchief
MyISAM is fast at reads. You can use innodb for the transactional, writeable
master, and replicate to MyISAM for read-only slaves. This also gives you
full-text search.

~~~
HarrisonFisk
Then your slaves will lag really badly while the replication thread keeps
stalling on table level locks.

InnoDB is faster at reads for most workloads on modern hardware. MyISAM has
horrible scalability across multiple CPUs. MariaDB recently improved in MyISAM
with the segmented key cache, but I haven't seen any results of head-to-head
benchmarks since then.

------
sbierwagen
Running benchmarks dependent on I/O speed on AWS, where I/O speed fluctuates
wildly depending on who you're sharing the hardware with?

Hmm.

------
mikkom
What I find most interesting is how bad mongo performance is.

~~~
orthecreedence
Actually, looks like it does pretty well on reads. It's the writes that mongo
can't handle. But we all knew that already...

------
newman314
The article tries not to draw a conclusion but Cassandra's numbers look pretty
good to me...

------
falcolas
A note for people doing benchmarks - please post your configurations for the
services with your post as well - it would be even better if your tests were
posted on github or something similar so others can reproduce your tests and
validate/update them.

Also, unless you're running all of the tests concurrently, we really need to
see your IOPs records during the tests, since a single lagging EBS volume in a
Raid-0 array will still negatively affect disk performance (either that or
spring for dedicated iops), and thus skew your benchmarks.

Making sure you're not getting a spike CPU steal time would be great as well;
same reason.

------
dschiptsov
Well, the data never leave JVM, that is why it is so "fast" - it never fsyncs.

What if JVM instance crash under load? Data lose, but, see, it is not our
fault - our code is OK.

~~~
crypto5
Not exactly. Cassandra writes commit log, and if JVM will crash, cassandra
will repair data from that log.

~~~
dschiptsov
how often it fsyncs that log?

~~~
crypto5
It can be configured: <http://wiki.apache.org/cassandra/Durability> One option
is to fsync changes before telling client that operation is succeeded.

~~~
dschiptsov
And get what benchmarks?)

~~~
crypto5
Benchmark will be not so good, but do you have better choice?

------
opendomain
How would you compare SQL benchmarks? Oracle, mySQL, SQL Server, Postgresql?
You can find any specific use case where one will out-perform the others. A
lot of DBAs assume the Oracle is the most powerful, but it is also harder to
manage and VERY expensive to run. I know that most NoSQL databases are free,
except for support or multiple datacenter usage- but what about the cost for
DevOps? Or backup?

~~~
taligent
Almost all NoSQL databases are open source and free for multiple datacenter
usage.

And they are all just as easy to manage and backup as everything else. They
all have their problems and issues.

~~~
bonzoesc
How do I back up a 20-node Riak cluster as easily as a Postgres cluster of any
size?

Disclosure: I work for Basho, makers of Riak.

------
xoail
I guess the conclusion pretty much sums it up. Every db solution has its own
advantages and disadvantages. I find many people jump on latest bandwagon and
later regret the choice followed by a "Why we moved away from xyz database"
post on their blog. Please make sure to analyze your application and future
strategy fully before choosing one.

------
sh_vipin
Thanks for this post. It was really good comparison.

------
neya
Thank you for posting this. I actually spent a considerable amount of time
searching for similar benchmarks a few weeks ago.

------
bborud
"After some of the results had been presented to the public, some observers
said MongoDB should not be compared to other NoSQL databases because it is
more targeted at working with memory directly. We certainly understand this,
but the aim of this investigation is to determine the best use cases for
different NoSQL products. Therefore, the databases were tested under the same
conditions, regardless of their specifics."

Oh, great.

------
nirvana
It's a shame they didn't test CouchBase 2.0 (or at least 1.8)... and really
kinda silly, given that it is pretty widely used commercially. I think
CouchBase may be the most successful NoSQL database, when it comes to
commercial installations (for large customers, maybe mongodb has more total
customers.)

Plus, I think it would have scored very well here.

~~~
ericcholis
I was curious about CouchBase myself. I know it's hard to run tests on every
variation of the NoSQL fad, but CouchBase (IMHO) is a pretty sizable player.

