Hacker News new | comments | show | ask | jobs | submit login
On moving from CouchDB to Riak (linkfluence.net)
152 points by franckcuny 2511 days ago | hide | past | web | favorite | 41 comments

I went through a length evaluation process of Riak recently, and came away with a generally positive impression.

First of all, it's beautifully engineered, as long as you just need a KV store or a graph DB (I wasn't in love with the MapReduce stuff, but that's another story). None of the hassle that Hadoop/Hbase have about some nodes being special (HBase Master, HDFS Namenode, etc). Also, no running multiple daemons per server (e.g., no distinction between regionserver and datanode daemons, like HBase). Easy config files, simple HTTP API (so you can just throw an off the shelf load balancer like HAProxy in front of it), and lots of little things that just make life easier.

I also really like how it's very explicit about the CAP tradeoffs it makes - with powerful conflict resolution available for when consistency has been sacrificed (instead of trying to sweep the issue under the rug, like many other distributed dbs do).

However, there are a few downsides.

First, as mentioned in the article, with the default backend (a kv-store called 'bitcask') all the keys per node have to fit in memory (and each key has, on the order of, 20 bits of overhead, IIRC). Annoyingly, this fact isn't noted anywhere in the Riak documentation that I saw (although, there is a work-around by using the InnoDB backed). This won't matter too much for many use cases, but it can be pretty painful if you're not expecting it.

Second, you can guard against data loss by specifying N copies of each piece of data are stored on your cluster. However, under some conditions (https://issues.basho.com/show_bug.cgi?id=228), the data may be replicated on less than N distinct nodes. So, the failure of less than N nodes could result in dataloss.

Finally, to the best of my knowledge, the largest Riak clusters in production are around 60 nodes, while Hadoop has 2000 node clusters running (e.g., FB) in production. Perfectly acceptable for most users, but just one more thing to worry about if you're planning to roll out a large cluster (more potential for 'unknown unknowns', so to speak).

I've given a few talks about NOSQL in general and Riak in particular and blogged[0] about both. I agree with all of your comments and will try to add a bit. One of the major driving design considerations of Riak is that of predictable scalability. Meaning that for each unit of hardware that you add to the system you are returned a predictable level of performance. You see this throughout the design of the system. Homogeneity amongst the nodes - meaning that all nodes are the same and there are no special nodes - is a big win. So is tunable CAP, as you already mentioned. Another major win from my perspective is the modular nature of the code base. I think it is one of the best layouts of a large system that I have seen yet. If you look at the main Riak repo you will see that it is comprised of a few sub modules. Basically most of the system is compartmentalized in this fashion. Major advantage is that development of components can proceed at their own pace and components can be reused/mixed more easily.

Re. Key limitations when using the default bitcask backend:

There is a spreadsheet[1] that outlines the number of keys you can have in your system based on a few variables. Definitely worth checking out. Two things to note here. There is the hard limit of keys per machine - due to each machines max memory - and there is the larger limit of keys per cluster based on the max memory of the cluster. Note that all this applies specifically when using the bitcask backend. There has been lots of talk about how to change this going forward and I know the Basho team is looking into it. Riak is quite interesting in that it can support a number of different backends - at the same time even. So you could have a bitcask backend for some data and a memory backend for other data. Since Riak is distributed, the implication is that outside of thinking about a single machines resources you should also think of the total clusters resources. Particularly, cluster total memory, total cpu cores and total disk spindles, that last one quite important but under-considered.

Re. max nodes in production:

One of the major considerations when dealing with a Dynamo derived system like Riak is cross node chatter. Riak does a lot of its magic by way of a gossip channel that is sending around all kinds of data. As the number of nodes in the system increases the level of chatter increases. I think there needs to be more work in optimizing that chatter for larger cluster sizes and I that has been happening if you follow the change logs.

If you follow the mailing list or irc you will notice that the primary concern amongst people new to Riak is that of querying. Riak has no secondary indexes, outside of Riak search, which is a separate download that is built on top of Riak (see code modularization). Riak has no native ordering. All of those things need to happen in the m/r phase. As it stands I think this is one of the major friction points to further adoption and why I have consistently recommended pairing Riak with Redis whenever possible.

These limitations aside, Riak represents, IMHO, the best mix of distribution, ease of use, raw power and growth potential of the currently in production NOSQL persisted datastore offerings. Philosophically I am very much attuned to the Dynamo world view and as a strict adherent think that Riak is the best representative from that perspective. Take that for what you will.

Disclaimer - Riak and Redis are my nosql databases of choice.



> (and each key has, on the order of, 20 bits of overhead, IIRC)

20 bits? Really? Less than an integer? Or did you mean bytes? (Not nitpicking here, I'm just curious)

I believe the limit using the current bitcask backend is (40 bytes + average key size) * (replication factor) / (number of cluster nodes * memory capacity of the smallest node). If that factor grows above 1, you can't store any more.

IIRC correctly a 64-node cluster with 24gb of ram per node will handle a few billion 32-byte keys, replicated to three nodes. For larger keyspaces, the current recommendation is to use innodb, which doesn't need to keep keys in memory.

Given that Riak seems to implement buckets "for free" by essentially making them key prefixes under-the-hood, does the bucket name size need to be considered as part of the key size? e.g., if my bucket name is 32 chars and my keys in that bucket are 32 chars, should I be using 64 bytes for average key size?

This is a question I've gotten conflicting answers to.

I think 20 bits is too small. I've been looking for a source without success; the closest I've found[0] is 32-75 bytes per key which is something like 11 million keys per GB of RAM.

[0] http://www.quora.com/Matt-Heitzenroder

20 bits can support a million records. As an average, per node or cluster, it seems reasonable to me.

Only a million records? That's waaaay too small.

Describing Bigcouch as a bit of hack while admitting you've not used it seems a little unfair. That said, it's very valuable to hear how other people see our product.

Without doing a full-on sales pitch (I am not a salesman and do not portray one on TV), I should say that Bigcouch adds a lot of desired features to CouchDB, notably scaling beyond a single machine and sharding (which supports very large databases). We run several clusters of Bigcouch nodes in production and have done for some time, it's a thrill for me personally to see a distributed database take a real pounding and keep on trucking.

I've been meaning to try Riak myself, so you've inspired me to finally pull down the code and give it a proper examination.

My reading was that they found CouchDB to have problems and adding complexity on top of a problematic base wasn't an appealing proposition for them.

My counterpoint is that Cloudant test and harden the CouchDB that we embed, it's not, for various reasons, a verbatim copy of CouchDB itself.

That said, vanilla CouchDB is quite stable. While there are always bugs, I don't recognize the system that the original poster is describing.

What kind of data do you record (number of documents, average size of document) ? We have to use couchdb at my current job, and I have encoutered all those issues for even moderate size (replication failing more often than @ ~30-40 Gb). I wonder if that's an environment issue.

The current CouchDB replicator is notoriously inadequate for the larger datasets or transfers across datacenters. A complete rewrite has been completed and will be merged in CouchDB & BigCouch soon.

Re. File size growth in Couch:

Couch files are written in an append only fashion so that all operations are considered appends, such as updates. The main upside is durability meaning the file is always readable and never left in an odd state. As noted, however, this has the downside of requiring compaction to reclaim disk space. You should note that if you are using bitcask, the default backend for riak, you will have the same problem that you had with couch as bitcask is also an aol (append only log) (aka. wol - write only log). Bitcask will also need compaction but as of the latest release there are various knobs to tweak regarding when that is queued up and when it is executed.

Re. database being one file:

I'm not exactly sure why couch uses one file but I suspect it is due to the use of b-trees internally. Would love to hear more from someone more experienced in couch. Riak splits its data amongst many files. Specifically one file (or 'cask' when using the default bitcask backend) per 'vnode' (virtual node, 64 by default). The drawback here is that you need to ensure that your environment has enough open file descriptors to service requests.

There are several reasons for CouchDB's use of a single, strictly append-only file, but the b-tree is not one of them.

In fact, it's one of CouchDB's most clever tricks to store a b-tree in an append-only fashion (it's far more common to update in place).

Two reasons CouchDB is strictly append only.

1) Safety: By never overwriting data, it avoids a whole class of data corruption issues that plague update-in-place schemes.

2) Performance: Writing at the end of the file allocates new blocks, leading to lower fragmentation and less seeking than an update-in-place scheme.

"durability" does not mean "the file is...never left in an odd state". The right word there is "consistent". CouchDB provides both (In fact it provides all four ACID properties, but the scope of a transaction is constrained to a single document update).

Finally, I should point out that using a single file, as opposed to a series of append-only files (like Berkeley JE, for example) is just a pragmatic choice. Nothing in principle would prevent a JE style approach, it's just a lot of (quite hard) work to do well.

The main two reasons that CouchDB doesn't use multiple files per database are system limits and increased complexity.

CouchDB has a bit of an alternative design in that it accepts that people might be running a large number of databases on a single node. I don't remember the exact numbers but I think we've heard of deploys using 10-100K (small) db's on a single node.

As to complexity, with a single file, there's no magical fsync dance to coordinate when committing data to disk. Its not unpossible, its just more complex.

Its not out of the question that CouchDB will move to using multiple files per database, but as its open source, the biggest road block so far is that no one's needed it badly enough to implement it.

I've noticed that interesting things tend to show up in bigcouch before couchdb. So, if I were starting a couchdb based project today, it would be with cloudant.

"We store a lot of data... 2TB"

Given that a pair of 2TB drives is less than $250 on ebuyer right now, 2TB of data is not 'big data'. You could comfortably stuff that in any decent database (SQL Server for example, I'm sure PostgreSQL would work too).

Just because a tiny machine on slicehost isn't big enough doesn't mean that your data won't fit in a normal database.

> You could comfortably stuff that in any decent database (SQL Server for example, I'm sure PostgreSQL would work too).

You may have noticed that the OP's requirements were: * a REST interface * sharding The costs of SQL Server are unbelievably high, and which you can do horizontal partitioning and sharding, Riak is designed to be used in this manner. On the other hand, re: SQL Server according to this article the "horizontal partitioning feature requires Enterprise or Datacenter licenses which have a retail price of 27,495 and 54,990 per processor." I think the OP made a wise choice. http://www.infoq.com/news/2011/02/SQL-Sharding

You don't necessarily need to shard if the data fits comfortably on a standard disk... and the parent did mention PostgreSQL.

Sharding (for free) on PostgreSQL is surprisingly painful

I thought people just built sharding into the application level when using MySQL stuff - I'd imagine the same thing works just fine for PostgreSQL. Or were you talking about something else?

Anyone who volunteers for app level sharding is going to find themselves deep in Special Hell. Step one is to rewrite all your queries so they no longer expect really exotic use cases like "SELECT ... WHERE" to actually work until you somehow figure out which nodes to run them on (and if it was supposed to be a join between entities crossing shards, good luck with that). Step two is to find working XA-aware drivers (haha). For step three you need a knife and a goat....

Apparently really high-end clustered databases solve this problem correctly, i.e., the schema is extended to specify where any record can be found, the cluster uses that to work out a minimally-stupid query plan, and record->shard mapping becomes merely a tuning decision. I've never had the opportunity to work with one. But I'm not bitter.

Actually, if you use MySQL as a "great hashtable in the sky (cloud)", it's not all that much more painful to shard things. That said, if you do that you're already in a state of sin/pain...

'Sharding' and 'a rest interface' are not really business requirements. Sharding is only a solution to handling certain types of high load, and REST is but one solution for IPC.

Just because your data will fit in a normal database doesn't mean it will be fast enough.

Anyone planning to store significant quantities of data across multiple distinct nodes should have a look at the Disco Distributed File System:


It handles 'data distribution, replication, persistence, addressing and access' quite well.

> We rolled out a cluster of 5 nodes and 1024 partitions. Each node has 8gig of memories, 3x1Tb disk in RAID5.

This hardware configuration doesn't make sense. 8 gigabytes of memory is pathetically small today. 32 gigs is basically standard, with 64 gigs costing little extra. RAID 5 with only three disks will have horrible performance. A single machine with 32/64 gigs and 12-16 disks in a better RAID configuration should perform better at a lower operational cost.

Don’t forget what node separation gets you in terms of availability and partition-tolerance. But I do agree that the 8gig, RAID5 and 3-disk specs seem odd. It'd be useful to get some additional details from Franck or the Basho folks as to why they chose those.

What I disliked about Riak is that although at first glance it appears you can namespace key\values into multiple buckets you can't. The whole database is really only a single bucket and if you want to run a map\reduce it's always over every key in the database!

Now you can simply increase complexity and create multiple Riak clusters for different data schemes and treat it as a replication tool. However, their in memory bitcask map-reduce was actually slower than a hard-drive map-reduce on text JSON files in a folder. Where each file was individually opened, read from disk, encoded, and closed in a loop (with nothing held in memory). Which was rather scary!

Their replication scheme appears to be a pretty genuine copy of Amazon Dynamo though (consistent hashing ring, quorum, merkle trees) which is nice.

To my knowledge, there is no such thing as an in-memory bitcask backend for Riak. Could you have been using regular Bitcask (keeps values on disk in log-structured files) or the ETS memory backend?

Bitcask is an in memory datastore because it keeps a copy of the entire dataset in memory. 4GB of data==4GB of memory>trivial amount of memory used by a file descriptor.

>keeps values on disk in log-structured files

Of course it does. Log files are the easiest way to make an in memory datastructure persistant. Logs are generally used for periodically dropping fast writes as a backup in case you want to reload the memory datastructure in the future. Read requests are structured and served from the in memory datastructure and do not require a disk access.

The fact that you have something using all your RAM that should require 0 disk accesses for reads NOT substantially beating something performing thousands of disk accesses is a FUBAR situation.

Bitcask is an in memory datastore because it keeps a copy of the entire dataset in memory.

Begging your pardon, but I think you may be misunderstanding bitcask. The bitcask keydir is stored in memory, but the values are stored on disk. The keydir is a hash mapping each key to a file ID and the offset/size in that file at which the value is stored. The only time values are stored in memory on is when the kernel's fs cache or readahead buffer provide them.


Since a filesystem directory listing is likely held by your OS cache, you should see similar performance between bitcask and files on disk: an in-memory lookup to obtain the inode/offset, and a disk seek+read.

Riak will be substantially slower than directly using bitcask, however, because you may need to talk over the network to as many as N nodes, wait for all their responses, and compute the resultant vclock/metadata, and then serialize it for HTTP (huge TCP overhead) or protocol buffers (relatively fast). If you're running on a single machine, then you may incur additional time for that machine to run what would normally be distributed over three nodes. Without knowing more about your benchmark, however, it's difficult to say.

>you may be misunderstanding bitcask

Yes I was, I apologize. It must have been MongoDB that required total data be no larger than amount of RAM.

The test was run using protocol buffers and Python client on the same computer testing reads vs. a naive map reduce. The naive map reduce was storing 4,000 files in a folder, treating the filename as the key and parsing the text file contents from JSON into a python dictionary to see if an attribute matches. Basically I figure doing a loop of thousands of blocking disk accesses from a laptop harddrive on a standard filesystem buffering nothing in memory should always be much slower than any database.

This was last year so maybe Riak's performance has increased since then. I'd be interested if TokyoCabinet was added as a backend.

Riak, by default, uses a replication value of three. Your single test machine has to do ~three times the work, so you should expect slower performance here. (I'm oversimplifying somewhat.)

You'll see significantly improved performance on a linear test (in my informal testing, 3-4x speedups) by adding an extra two nodes. Parallelized tests pretty much scale linearly with nodes.

In practice, I've found Riak to be slightly slower than MySQL. Direct reads/writes tend to be fast, but JSON parsing can bite you and denormalization requires more writes. The major advantage is that the Riak system can scale linearly with nodes, and that it can fail in predictable and resolvable ways.

As an example, the feed system I'm currently building on Riak will survive a total network partition and allow full reads and writes from every node with no data lost. Everything is automatically merged when the partition ends. The vclock-tagged multi-value functionality of Riak is exceptionally powerful when you want to design these types of systems, and is, in my mind, worth the performance hit and additional design complexity for certain classes of problems.

This was last year so maybe Riak's performance has increased since then. I'd be interested if TokyoCabinet was added as a backend.

There are also InnoDB and multiple in-memory backends, which may provide performance characteristics more in line with what you are looking for.

Bitcask is an in memory datastore because it keeps a copy of the entire dataset in memo

Not quite. The keys are in memory along with disk position to retrieve the associated value with one seek.

It's great to see that I'm not the only person having problem with CouchDB's compaction failing when dealing with larger datasets.

I had the same exact experience during my master's thesis (hn: http://news.ycombinator.com/item?id=2022192 ) and this was one of the reasons why CouchDB didn't seem a good solution for a dataset that has a high number of updates.

I'm a little let down by the detail of this blog post.

For anyone who cares: we've had 6x as many documents (as linkfluence) in a single CouchDB instane before we moved to Cloudant. We now have over 360 million documents on Cloudant.

Our database is very write intense, lots of new documents, little updates to the actual documents and a lot of reads of the documents in between. We also have a couple larger views to aggregate the data for our users.

The ride with CouchDB has been bumpy at times, but not at the 'scale' where linkfluence is at.

> Each modification generates a new version of the document. (...) we don’t need it, and there’s no way to disable it (...)

Note that there's a `_revs_limit` setting available: <http://wiki.apache.org/couchdb/HTTP_database_API>. It's very beneficial for a use case like yours.

(Though I've seen CouchDB performing rather poorly when taking really heavy read/write load or compacting big datasets, on occasion.)

Hrm... with N "nosql" things, there's great potential for writing. Redis vs Riak. Cassandra to CouchDB. Why I gave up on Mongo and moved to Mnesia. And so on and so forth;-)

Your comment applies to any job for which there are multiple competing tools.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact