
Cassandra vs MongoDB vs CouchDB vs Redis vs Riak comparison - kkovacs
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
======
antirez
I like this article: while it is for sure not the definitive guide to NoSQL,
it is a short description mostly about facts that people new to the field can
use to get an idea about what a good candidate could be for initial
experimentation, given a defined problem to solve.

That said I think that picking the good database is something you can do only
with a lot of work. Picking good technologies for your project is _hard work_
, so there is to try one, and another and so forth, and even reconsidering
after a few years (or months?) the state of the things again, given the
evolution speed of the DB panorama in the recent years.

While I'm at it I like to share that in this exact days I'm working at a Redis
disk back end. I've already a prototype working after a few days of full
immersion (I like to use vacation time to work at completely new ideas for
Redis).

The idea is that everything is stored on disk, in what is a plain key-value
database (complex values are serialized when on disk), and the memory is
instead used as an object cache. It is like taking current Redis Virtual
Memory and inverting the logic completely, the result is the same (working set
in memory, the rest on disk), but this implementation means that there are no
limits on the data you can put into a single instance, that you don't have
slow restarts (data is not loaded on memory if not demanded), and there isn't
to fork() to save. Keys marked as "dirty" (modified) are transfered to disk
asynchronously as needed, by IO threads.

If everything will work as I expect (and initial tests are really encouraging)
this means that Redis 2.4 will exit in a few months completely killing the
current Virtual Memory implementation in favor of the new "two back ends"
design, where you can select if you want to run an in-memory DB or an on-disk
DB where memory is just an LRU cache for the working set.

~~~
kkovacs
antirez, it's a honor that you commented, thanks! :)

The new inverted logic for the VM you describe seems very interesting; I'm
very much looking forward to see 2.4!

Redis is already more than perfect what we use it for -- keeping track of
stock price data, and distributing it. The size of the DB is known in advance
(the amount of stocks does not grow very fast), and the performance is
perfect.

Keep up the good job! (And have a nice new year)

Kristof

~~~
antirez
Thank you kkovacs!

I think the main business of Redis is still as an in-memory DB / cache /
messaging system and so forth. We have a decent implementation from this point
of view, so the next logical step is making it working in a cluster.

On the other side it's really interesting to see what people can do with Redis
data model if much larger datasets can be used without problems (at the cost
of performances of course... can't be as fast as memory). VM was my first
idea, but I need to admit, I don't like the design at this point. This new
design can be much better, and we can have it production ready in a few
months. So I'm curious about what will happen in 2011! :)

Thank you and have a nice new year as well.

~~~
kkovacs
Thanks :)

------
ghshephard
Worth adding HBase?

Much below Stolen from their overview page (All needs to be confirmed):
<http://hbase.apache.org/>

WRITTEN IN: Java

MAIN POINT: Hadoop Database

LICENSE: Apache

PROTOCOL: A REST-ful Web service gateway

This project's goal is the hosting of very large tables -- billions of rows X
millions of columns -- atop clusters of commodity hardware.

HBase is an open-source, distributed, versioned, column-oriented store modeled
after Google' Bigtable: A Distributed Storage System for Structured Data by
Chang et al. Just as Bigtable leverages the distributed data storage provided
by the Google File System, HBase provides Bigtable-like capabilities on top of
Hadoop. HBase includes:

Convenient base classes for backing Hadoop MapReduce jobs with HBase tables

Query predicate push down via server side scan and get filters

Optimizations for real time queries

A high performance Thrift gateway

A REST-ful Web service gateway that supports XML, Protobuf, and binary data
encoding options

Cascading, hive, and pig source and sink modules

Extensible jruby-based (JIRB) shell

Support for exporting metrics via the Hadoop metrics subsystem to files or
Ganglia; or via JMX

HBase 0.20 has greatly improved on its predecessors:

No HBase single point of failure

Rolling restart for configuration changes and minor upgrades

Random access performance on par with open source relational databases such as
MySQL

FOR EXAMPLE: Facebook Messaging Database

BEST USE: Use it when you need random, realtime read/write access to your Big
Data.

~~~
kkovacs
Thanks, Gordon, I put it in (shortened some lines).

It would be great to have a "more general" for-example, since noone outside
Facebook meets the problem of "let's build Facebook's messaging database" :)
Any suggestions?

~~~
sigzero
There might be something you can glean from:

<http://wiki.apache.org/hadoop/Hbase/PoweredBy>

------
littleidea
Apples vs Oranges vs Strawberries vs Pineapple vs Grapes

Apples usually stay crispy unless baked. Good in pies.

Oranges can be sour (or sweet). Do not bake.

Strawberries are red. Good in pies, advise against baking.

Pineapples are rough on the outside. Good fresh, baked, grilled, fried,
debatable on pizza.

Grapes come in many colors and sizes. Great fresh or turned into alcoholic
beverages.

(Not the worst introduction to fruit, but perhaps superficial? Amirite?)

~~~
nkurz
Actually, this would be useful and not superficial to someone who has seen
pictures of these fruits but has never seen one in real life. Obviously it can
be refined, but the idea is not bad. For example, try this with some quite
different but seemingly similar fruits that most people are not as familiar
with:

Pineapple Guava (Acca Sellowania) -- Small green fruit. Seeds soft and edible,
skin optional. Turpentine flavor signals overripe. Cold hardy and grown in
many parts of the US as an evergreen ornamental. Delicious eaten raw.

Strawberry Guava (Psidium cattleianum) --- Tasty small soft red fruit with
very fragrant aroma and many small hard seeds. Skin edible, but seeds best
avoided. Can be eaten out of hand, but low commercial use. Frost tolerant in
mild climates.

White Guava (Psidium guajava) --- True tropical guava, thus barely if at all
frost-tolerant. Large fragrant fruit with inedible hard seeds. Usually used
for juice or puree, rarely eaten out of hand. Wonderful strong aroma increases
with ripeness.

While obviously not of use to a producer of guavas, this sort of cheat sheet
might be helpful to someone who happens to encounter one of these varieties in
a grocery store or tree nursery. At the least, it might keep someone from
breaking their teeth on the inedible hard seeds!

~~~
kkovacs
Wow, thanks, very nice example! (And also thanks for expanding my fruit
knowledge :) )

------
benblack
This article is mostly marketing phrases from the websites of the various
projects. Sadly, much of it is inaccurate, extremely skewed, or otherwise not
useful for the stated purpose of comparing the listed databases.

For example, CouchDB having a "Main Point" of "DB consistency" might be the
case, as it is for Redis, when there is no replication. In replicated
configurations, it is definitely not true. Further, the MVCC is weaker in many
ways than in a Dynamo system like Riak as you have no way to influence or
discover consistency between replicas.

I'm sure folks expert in other systems can identify similar errors in the rest
of the post. Can someone explain to me who the target audience is for all
these NoSQL comparison articles? They are universally poor, yet universally
popular.

------
arethuza
My understanding is that in CouchDB you can't guarantee that older versions of
documents will still exists (they might be there, but they could have been
removed by compaction or not replicated).

However, there is a fairly nice way of storing older versions of documents -
hold older versions as file attachments on the document. See:

[http://jchrisa.net/drl/_design/sofa/_list/post/post-
page?sta...](http://jchrisa.net/drl/_design/sofa/_list/post/post-
page?startkey=\[%22Versioning-docs-in-CouchDB%22\])

~~~
iamwil
This is probably the biggest misunderstanding of couchdb, imo. The versioning
system in couchdb is only there to make the seamless replication possible.
There's no guarantee that previous versions will exist at a future time, like
in git.

Where couchdb has some immense possibilities is in distributed applications,
not only server side, but also mobile phones and browsers. Since you can write
and contain an entire webapp inside of couchdb, you can technically replicate
the entire app to your mobile phone, and it'd work offline or online. And if
you need your app on another platform--as long as it has couchdb, you can just
replicate it there.

I never see this mentioned in any overviews of comparison for couchdb.

The sticking point right now, though, is that couchdb isn't on very many
mobile platforms. There has only been experiments with writing couchdb on top
of HTML5's localStore, and jChris et al are working on Couchdb for android.

~~~
bitdiddle
"The versioning system in couchdb is only there.."

CouchDB does not version, period.

~~~
Periodic
The "versioning" is really just there to support their optimistic concurrency
model, if I recall. The idea is that you know you need to retry your operation
if the version hash of the file has gone up since you last read the data and
thus you know your local file is out of date.

As I recall, the id field is just a string. It's just common to let it do the
automatic "#-hash" representation.

It's been a while since I played with CouchDB though, so I could be off.

~~~
dhimes
You are correct.

------
ocharles
> While SQL databases are insanely useful tools, their tyranny of ~15 years is
> coming to an end

This shit, AGAIN? Really? No, they are not.

~~~
va_coder
With AppEngine at Google, MongoDB at Disqus, Cassandra at Facebook and Redis
at Github you can definitely say that SQL databases are one of many options
available today and don't dominate like they did 5 years ago.

~~~
dennyabraham
if i'm not mistaken, the majority of those organizations still rely heavily on
relational datastores, except in the case of exceptional workloads. in
addition, i believe facebook has since migrated away from cassandra to the
hadoop stack for their messaging platform, though they primarily use mysql (or
its successors).

SQL is being replaced in niches that strain its model. elsewhere, it remains
steadfast.

~~~
kkovacs
I agree, ultimately it comes to everyone's own definition of "tyranny". :) I
meant it as "not really having any defensible other choice". (While, of course
for example CDB and BerkleyDB have "always" existed.)

I think nobody expects SQL's "market share" to fall to low levels, especially
with noSQL requiring much deeper understanding of the data and it's planned
use. NoSQL practically operates on a lower layer than SQL does.

Still, it's nice to see people thinking about data storage choices and not
going blindly to MySQL/Oracle/etc!

------
markoa
What we're missing are similar arricles that go into disadvantages and
implications on deployment.

Eg I have found out that deploying Tokyo Tyrant in a Rails project requires
you to write some sčripts to ensure that things run properly. Also the db size
has to be set in configuration in advance.

MongoDB OTOH is not designed for a single server environment, has a very small
max document size, easily gets corrupted if process is stopped etc.

------
nl
CouchDB & MongoDB both share one property that this comparison misses (or
mentions only in passing).

Both are schema free datastores. For me, this is the biggest, most useful
difference between them and traditional SQL databases, because it makes things
easy that are very, very hard (or inefficient) on an SQL database.

It's probably also worth noting that other NoSQL solutions don't share this
advantage. For example, Cassandra requires all nodes to be restarted to apply
a schema change, which can be quite a big deal.

~~~
tylerhobbs
"Cassandra requires all nodes to be restarted to apply a schema change, which
can be quite a big deal."

That's no longer true. In 0.7, keyspaces and column families may be created,
altered, or dropped live.

~~~
nl
I thought it might have been fixed by now.

Anyway, you still need a schema with Cassandra.

~~~
rbranson
Not in the same sense as an SQL database. You can freely add columns and rows,
just not Column Families or Keyspaces. This is because a KS+CF combo is stored
in it's own file, in a certain order, so that it can be efficiently traversed
using natural ordering. If you don't have this need and just need a flat K/V
database, you can use a single KS+CF for everything.

------
kkovacs
I think it's a nice closing word from @jzy:

A SQL query goes into a bar, walks up to two tables and asks, "Can I join
you?" "No, but you can enjoy the view."

Sorry :)

K.

------
schmichael
Under protocols you may want to specify MongoDB's as BSON and Cassandra's as
Thrift. That would be more helpful than "binary/custom".

Updated:

Also Redis's main selling point is it's extensive data structure/operations
support. "Blazingly fast" really depends on what your workload is and what
you're comparing it against.

~~~
kkovacs
Protocols: great idea, thanks man, amended it!

Blazing fast: I mean compared to the other four.

~~~
schmichael
I've had both MongoDB and Cassandra perform nearly as well as Memcache when
getting a single document/row when the document was in memory in MongoDB and
the row was cached (row cache, not just key cache, so again: fully in memory)
in Cassandra.

In memory operations are fast in many databases. Redis's default configuration
(vm-enabled no) just only does operations in memory (with an occasional sync
to disk). That's terrible durability but fantastic performance. Most
databases, including Redis, can be configured for either that sort of high-
performance/low-durability or the opposite. It's just that their default
settings/behaviors vary widely.

------
paxa
Also is VertexDB - small graph database. It's written in C, uses Tokyo Cabinet
for storing data. Simple http filesystem-like interface. The general advantage
- links, that allow to make graph structures on database level.

<https://github.com/stevedekorte/vertexdb>

------
waratuman
You mention that some of these solution could be used in the Financial
industry. I would be cautious of using these, especially since some are
eventually consistent. If you are just tracking data these may be fine though.

~~~
kkovacs
I agree with that, but you know, it's very hard to find good examples. :) I'm
open to better suggestions :)

~~~
hendler
Some points on Cassandra:

\- Facebook designed it for inbox feature - \- SOLR/Lucene is being integrated
\- recently Sequoia backed Riptano - see <http://www.riptano.com/>

~~~
mjphilli
But facebook recently dumped it for inbox feature and began using Hbase
instead. Not sure if Sequoia backing Riptano is supposed to be a bug or a
feature?

------
mikeytown2
I was hoping HandlerSocket would be in here. If you don't know about it, check
it out <http://news.ycombinator.com/item?id=1886137>

~~~
kkovacs
I don't know that one yet, but I promise I'll read up on it! Thanks!

K.

~~~
mikeytown2
The latest Percona flavor of MySQL has it built in
[http://www.mysqlperformanceblog.com/2010/12/14/percona-
serve...](http://www.mysqlperformanceblog.com/2010/12/14/percona-server-now-
both-sql-and-nosql/)

------
lukev
Interesting and useful.

One major feature differentiator is something it doesn't really talk about,
though - how conducive is each system to Massive Data?

For example, he kind of has a bone to pick with Cassandra, which is probably
justified. But from what little I know, one of the features of Cassandra is
that it's designed to scale pretty much to infinity. That may be true of a
couple of the others, but for some (like CouchDB) it isn't a design goal at
all.

~~~
kkovacs
Good point, and it's not there since I only wanted to speak from experience;
especially with rumors of Cassandra scaling problems at Reddit and Digg.

But sure thing, "infinite" scaling is probably best done with the Dynamo-like
stuff like Cassandra and/or RIAK.

~~~
DennisP
Reddit runs Cassandra with just a few nodes. Cassandra scales up well, but
doesn't scale down as well:

[http://www.reddit.com/r/announcements/comments/c2spc/reddits...](http://www.reddit.com/r/announcements/comments/c2spc/reddits_may_2010_state_of_the_servers_report_or/c0ptl28)

------
fjabre
Membase?

Using it in a recent project and it's been working great for us.

~~~
mjphilli
Membase is the one to watch. I saw a presentation by AOL @ Hadoop World re:
their use of Membase. Incredibly impressive technology.

------
Juha
Thank's for the article, good information.

Does anyone have any user amounts about the different no-sql databases? Or
just say two most popular ones? I guess some of them will rise above the
other's in following years and some will drop. User amounts would indicate
which ones have most potential to stay around and be accepted as standard no-
sql databases.

------
redthrowaway
So if Cassandra writes are much faster than reads, why would Reddit go that
route? Their comment server is consistently breaking on them, and it would
seem that a sub-optimal choice of db might be partly to blame.

~~~
rbranson
It's not as lop-sided as this article might have you believe and has largely
been mitigated as of late. This is because Cassandra uses read repair, which
is a big component of it's strategy to make both reads and writes to scale
linearly while also ensuring durability.

What is your suggestion otherwise? Any distributed database that is going to
be inexpensive, performant, scalable, and durable will need to use some kind
of quorum read repair system. Riak, Voldemort, and Dynamo all use read repair
with high levels of production success.

------
luca_garulli
Good article, it's a good starting point to let the people to decide where to
start in using a NoSQL solution. But what about OrientDB? Do you plan to add
it in this feature comparison?

------
ares2012
I'm curious why you wouldn't include HBase as it's the dominant solution for
NoSQL in systems requiring data consistency?

~~~
kkovacs
Only beacuse I don't know HBase very well. I'll read up on it, at least try
it, and then add that too (and also maybe Tokyo Cabinet).

Thanks for the feedback!

~~~
muloka
Cool... and maybe also Voldemort?

------
laran
Nice writeup. A good survey of some key tools in the NoSQL space. Thanks!

