First of all, it's beautifully engineered, as long as you just need a KV store or a graph DB (I wasn't in love with the MapReduce stuff, but that's another story). None of the hassle that Hadoop/Hbase have about some nodes being special (HBase Master, HDFS Namenode, etc). Also, no running multiple daemons per server (e.g., no distinction between regionserver and datanode daemons, like HBase). Easy config files, simple HTTP API (so you can just throw an off the shelf load balancer like HAProxy in front of it), and lots of little things that just make life easier.
I also really like how it's very explicit about the CAP tradeoffs it makes - with powerful conflict resolution available for when consistency has been sacrificed (instead of trying to sweep the issue under the rug, like many other distributed dbs do).
However, there are a few downsides.
First, as mentioned in the article, with the default backend (a kv-store called 'bitcask') all the keys per node have to fit in memory (and each key has, on the order of, 20 bits of overhead, IIRC). Annoyingly, this fact isn't noted anywhere in the Riak documentation that I saw (although, there is a work-around by using the InnoDB backed). This won't matter too much for many use cases, but it can be pretty painful if you're not expecting it.
Second, you can guard against data loss by specifying N copies of each piece of data are stored on your cluster. However, under some conditions (https://issues.basho.com/show_bug.cgi?id=228), the data may be replicated on less than N distinct nodes. So, the failure of less than N nodes could result in dataloss.
Finally, to the best of my knowledge, the largest Riak clusters in production are around 60 nodes, while Hadoop has 2000 node clusters running (e.g., FB) in production. Perfectly acceptable for most users, but just one more thing to worry about if you're planning to roll out a large cluster (more potential for 'unknown unknowns', so to speak).
Re. Key limitations when using the default bitcask backend:
There is a spreadsheet that outlines the number of keys you can have in your system based on a few variables. Definitely worth checking out. Two things to note here. There is the hard limit of keys per machine - due to each machines max memory - and there is the larger limit of keys per cluster based on the max memory of the cluster. Note that all this applies specifically when using the bitcask backend. There has been lots of talk about how to change this going forward and I know the Basho team is looking into it. Riak is quite interesting in that it can support a number of different backends - at the same time even. So you could have a bitcask backend for some data and a memory backend for other data. Since Riak is distributed, the implication is that outside of thinking about a single machines resources you should also think of the total clusters resources. Particularly, cluster total memory, total cpu cores and total disk spindles, that last one quite important but under-considered.
Re. max nodes in production:
One of the major considerations when dealing with a Dynamo derived system like Riak is cross node chatter. Riak does a lot of its magic by way of a gossip channel that is sending around all kinds of data. As the number of nodes in the system increases the level of chatter increases. I think there needs to be more work in optimizing that chatter for larger cluster sizes and I that has been happening if you follow the change logs.
If you follow the mailing list or irc you will notice that the primary concern amongst people new to Riak is that of querying. Riak has no secondary indexes, outside of Riak search, which is a separate download that is built on top of Riak (see code modularization). Riak has no native ordering. All of those things need to happen in the m/r phase. As it stands I think this is one of the major friction points to further adoption and why I have consistently recommended pairing Riak with Redis whenever possible.
These limitations aside, Riak represents, IMHO, the best mix of distribution, ease of use, raw power and growth potential of the currently in production NOSQL persisted datastore offerings. Philosophically I am very much attuned to the Dynamo world view and as a strict adherent think that Riak is the best representative from that perspective. Take that for what you will.
Disclaimer - Riak and Redis are my nosql databases of choice.
20 bits? Really? Less than an integer? Or did you mean bytes? (Not nitpicking here, I'm just curious)
IIRC correctly a 64-node cluster with 24gb of ram per node will handle a few billion 32-byte keys, replicated to three nodes. For larger keyspaces, the current recommendation is to use innodb, which doesn't need to keep keys in memory.
This is a question I've gotten conflicting answers to.
Without doing a full-on sales pitch (I am not a salesman and do not portray one on TV), I should say that Bigcouch adds a lot of desired features to CouchDB, notably scaling beyond a single machine and sharding (which supports very large databases). We run several clusters of Bigcouch nodes in production and have done for some time, it's a thrill for me personally to see a distributed database take a real pounding and keep on trucking.
I've been meaning to try Riak myself, so you've inspired me to finally pull down the code and give it a proper examination.
That said, vanilla CouchDB is quite stable. While there are always bugs, I don't recognize the system that the original poster is describing.
Couch files are written in an append only fashion so that all operations are considered appends, such as updates. The main upside is durability meaning the file is always readable and never left in an odd state. As noted, however, this has the downside of requiring compaction to reclaim disk space. You should note that if you are using bitcask, the default backend for riak, you will have the same problem that you had with couch as bitcask is also an aol (append only log) (aka. wol - write only log). Bitcask will also need compaction but as of the latest release there are various knobs to tweak regarding when that is queued up and when it is executed.
Re. database being one file:
I'm not exactly sure why couch uses one file but I suspect it is due to the use of b-trees internally. Would love to hear more from someone more experienced in couch. Riak splits its data amongst many files. Specifically one file (or 'cask' when using the default bitcask backend) per 'vnode' (virtual node, 64 by default). The drawback here is that you need to ensure that your environment has enough open file descriptors to service requests.
In fact, it's one of CouchDB's most clever tricks to store a b-tree in an append-only fashion (it's far more common to update in place).
Two reasons CouchDB is strictly append only.
1) Safety: By never overwriting data, it avoids a whole class of data corruption issues that plague update-in-place schemes.
2) Performance: Writing at the end of the file allocates new blocks, leading to lower fragmentation and less seeking than an update-in-place scheme.
"durability" does not mean "the file is...never left in an odd state". The right word there is "consistent". CouchDB provides both (In fact it provides all four ACID properties, but the scope of a transaction is constrained to a single document update).
Finally, I should point out that using a single file, as opposed to a series of append-only files (like Berkeley JE, for example) is just a pragmatic choice. Nothing in principle would prevent a JE style approach, it's just a lot of (quite hard) work to do well.
CouchDB has a bit of an alternative design in that it accepts that people might be running a large number of databases on a single node. I don't remember the exact numbers but I think we've heard of deploys using 10-100K (small) db's on a single node.
As to complexity, with a single file, there's no magical fsync dance to coordinate when committing data to disk. Its not unpossible, its just more complex.
Its not out of the question that CouchDB will move to using multiple files per database, but as its open source, the biggest road block so far is that no one's needed it badly enough to implement it.
Given that a pair of 2TB drives is less than $250 on ebuyer right now, 2TB of data is not 'big data'. You could comfortably stuff that in any decent database (SQL Server for example, I'm sure PostgreSQL would work too).
Just because a tiny machine on slicehost isn't big enough doesn't mean that your data won't fit in a normal database.
You may have noticed that the OP's requirements were:
* a REST interface
The costs of SQL Server are unbelievably high, and which you can do horizontal partitioning and sharding, Riak is designed to be used in this manner. On the other hand, re: SQL Server according to this article the "horizontal partitioning feature requires Enterprise or Datacenter licenses which have a retail price of 27,495 and 54,990 per processor." I think the OP made a wise choice. http://www.infoq.com/news/2011/02/SQL-Sharding
Apparently really high-end clustered databases solve this problem correctly, i.e., the schema is extended to specify where any record can be found, the cluster uses that to work out a minimally-stupid query plan, and record->shard mapping becomes merely a tuning decision. I've never had the opportunity to work with one. But I'm not bitter.
It handles 'data distribution, replication, persistence, addressing and access' quite well.
This hardware configuration doesn't make sense. 8 gigabytes of memory is pathetically small today. 32 gigs is basically standard, with 64 gigs costing little extra. RAID 5 with only three disks will have horrible performance. A single machine with 32/64 gigs and 12-16 disks in a better RAID configuration should perform better at a lower operational cost.
Now you can simply increase complexity and create multiple Riak clusters for different data schemes and treat it as a replication tool. However, their in memory bitcask map-reduce was actually slower than a hard-drive map-reduce on text JSON files in a folder. Where each file was individually opened, read from disk, encoded, and closed in a loop (with nothing held in memory). Which was rather scary!
Their replication scheme appears to be a pretty genuine copy of Amazon Dynamo though (consistent hashing ring, quorum, merkle trees) which is nice.
>keeps values on disk in log-structured files
Of course it does. Log files are the easiest way to make an in memory datastructure persistant. Logs are generally used for periodically dropping fast writes as a backup in case you want to reload the memory datastructure in the future. Read requests are structured and served from the in memory datastructure and do not require a disk access.
The fact that you have something using all your RAM that should require 0 disk accesses for reads NOT substantially beating something performing thousands of disk accesses is a FUBAR situation.
Begging your pardon, but I think you may be misunderstanding bitcask. The bitcask keydir is stored in memory, but the values are stored on disk. The keydir is a hash mapping each key to a file ID and the offset/size in that file at which the value is stored. The only time values are stored in memory on is when the kernel's fs cache or readahead buffer provide them.
Since a filesystem directory listing is likely held by your OS cache, you should see similar performance between bitcask and files on disk: an in-memory lookup to obtain the inode/offset, and a disk seek+read.
Riak will be substantially slower than directly using bitcask, however, because you may need to talk over the network to as many as N nodes, wait for all their responses, and compute the resultant vclock/metadata, and then serialize it for HTTP (huge TCP overhead) or protocol buffers (relatively fast). If you're running on a single machine, then you may incur additional time for that machine to run what would normally be distributed over three nodes. Without knowing more about your benchmark, however, it's difficult to say.
Yes I was, I apologize. It must have been MongoDB that required total data be no larger than amount of RAM.
The test was run using protocol buffers and Python client on the same computer testing reads vs. a naive map reduce. The naive map reduce was storing 4,000 files in a folder, treating the filename as the key and parsing the text file contents from JSON into a python dictionary to see if an attribute matches. Basically I figure doing a loop of thousands of blocking disk accesses from a laptop harddrive on a standard filesystem buffering nothing in memory should always be much slower than any database.
This was last year so maybe Riak's performance has increased since then. I'd be interested if TokyoCabinet was added as a backend.
You'll see significantly improved performance on a linear test (in my informal testing, 3-4x speedups) by adding an extra two nodes. Parallelized tests pretty much scale linearly with nodes.
In practice, I've found Riak to be slightly slower than MySQL. Direct reads/writes tend to be fast, but JSON parsing can bite you and denormalization requires more writes. The major advantage is that the Riak system can scale linearly with nodes, and that it can fail in predictable and resolvable ways.
As an example, the feed system I'm currently building on Riak will survive a total network partition and allow full reads and writes from every node with no data lost. Everything is automatically merged when the partition ends. The vclock-tagged multi-value functionality of Riak is exceptionally powerful when you want to design these types of systems, and is, in my mind, worth the performance hit and additional design complexity for certain classes of problems.
This was last year so maybe Riak's performance has increased since then. I'd be interested if TokyoCabinet was added as a backend.
There are also InnoDB and multiple in-memory backends, which may provide performance characteristics more in line with what you are looking for.
Not quite. The keys are in memory along with disk position to retrieve the associated value with one seek.
I had the same exact experience during my master's thesis (hn: http://news.ycombinator.com/item?id=2022192 ) and this was one of the reasons why CouchDB didn't seem a good solution for a dataset that has a high number of updates.
For anyone who cares: we've had 6x as many documents (as linkfluence) in a single CouchDB instane before we moved to Cloudant. We now have over 360 million documents on Cloudant.
Our database is very write intense, lots of new documents, little updates to the actual documents and a lot of reads of the documents in between. We also have a couple larger views to aggregate the data for our users.
The ride with CouchDB has been bumpy at times, but not at the 'scale' where linkfluence is at.
Note that there's a `_revs_limit` setting available: <http://wiki.apache.org/couchdb/HTTP_database_API>. It's very beneficial for a use case like yours.
(Though I've seen CouchDB performing rather poorly when taking really heavy read/write load or compacting big datasets, on occasion.)