I am increasingly interested in Riak, in part because a very vocal minority on HN seems to think it is the One True NoSQL solution.
However, I still don't quite get it. What is Riak? It seems to be some sort of Dynamo implementation (like the ironically ill fated Cassandra), but apparently it has a workflow engine? What do people use it for? What is it best at?
Right now we're using PostgreSQL, Redis, and S3. PostgreSQL gives us ACID, Redis gives us fast in-memory access, and S3 gives us an infinite KV store. Is there some reason to use Riak? Would Riak just replace S3?
Ultimately it is a eventually consistent key-value store. They have map-reduce and a new meta-data thing that lets you query documents, but you really only want to use it if your data maps well to key/value (and lots of data does!). The really cool things are that it is truly horizontal scale - a new node brings read+write+map-reduce power, and that it will help you sleep better with its dynamo-ism.
I'm not so certain. A value written to riak has something on the order of 450bytes of overhead (as of version 0.14, not entirely certain of the exact overhead in 1.0). Basically riak will write your value to disk with a bunch of other data that it uses internally to do its thing. Writing a stream of integers, one integer per riak key, would be a bad idea (tm), imho.
NoSQL seems to me to be a scene as much driven by hype as logic. (For better or worse, practitioners using and programming NoSQL systems tend to have little understanding of the 30+ years of relevant RDBMS academic literature, and 20+ years of distributed RDBMS academic literature.)
Given that, NoSQL adoption often appears driven by success stories. But Cassandra seems to have the opposite: numerous failure stories. Facebook, Digg, Reddit and a number of others have all tried Cassandra in production, and have either had serious complaints or moved off to either SQL or other solutions like HBase.
Of course, these failure stories are anecdotes, and numerous unrelated factors (like bad interactions between Cassandra and Amazon's EC2) could be at fault. But I'm not sure it matters.
Has anyone on HN had a really good experience with Cassandra? (This may be the wrong thread to ask for obvious reasons.)
Riak users have had quite the opposite results based on talks given at conferences and videos put online. Yammer, Voxer, Formspring, Bump are all riak customers and all have a video out there somewhere talking about how much the like riak. Yes riak and cassandra share a little bit of technology (being dynamo inspired), but most of cassandra woes seem to come from operational difficulties and not the technology theory. Riak has more of a focus on operational friendliness out of the box.
There are videos and talks out there from several customers listed on Basho's site, just take a look.
I just did a rolling upgrade from 1.0b4, and it went smoothly. I love the LevelDB support, and it's holding up well under the considerable load that I'm throwing at it.
I think when you've got your release candidate, you're basically done, barring any unexpected surprises. Given stable Riak has been during the beta period, I don't think they're really jumping the gun much.
Riak 1.0 will be available later this month. To preview
some of the new features, download Riak, or to inquire
about a commercial deployment, please visit http://www.basho.com.
Their github page still just has 1.0-rc1 tagged. I'm excited though.
A bit tangential to this particular announcement - but i've been musing about using Riak, though so far put off by their (seemingly) open-core, rather than open-source implementation. Are the paid, enterprise functions stuff you eventually need in most use cases? the lack of multi-site replication in particular is curious; would this mean I can replicate between nodes on the cluster, as long as they are in the same datacenter, but not across the interwebs until i hand over some $$$?
Enterprise includes a ring replication layer designed for higher-latency connections.
There is nothing preventing you setting up a cluster that spans continents. What will deter you is the poor performance of the cluster due to the added latency between nodes.
Interesting - from what I remember of the the original Amazon Dynamo paper it seems the ring replication is pretty central to the thing (if we are both talking about the ring replication used for the distribution of keys across the nodes).
This is sounding like crippleware :(
Replicas (or as you put it, ring replication) is critical, and Riak very much has replicas. What it doesn't have in it's open source version is multi-ring replication (cross DC) which is a separate concern.
In the Dynamo paper the ring spans DCs but they also have a very different network than most that allows them to do that. In Riak it is recommended that each ring is contained in a single DC. If you want ring-to-ring replication from Basho then you can pay for Riak EDS. You could also build it yourself as others have mentioned (Kresten Krab Thorup has done something like this in Riak Mobile [1]).
Nothing is stopping you from running a single ring (cluster) across DCs, and it might even be okay for certain apps, but it's not a choice that should be taken lightly. In general, if you don't understand the tradeoffs you're making in that regard then it's best to stick to one ring, one DC.
Replication of keys around the ring is free. What you pay for is their solution to the problem of significant latency between nodes: code that replicates the whole ring in several sites and coordinates the communication between the sites.
The replication in the Enterprise version is replication between entirely different clusters. The ring replication you talk about is definitely open source.
From what I understand, you could build the same functionality on your own if you wanted. Riak has post-commit hooks that you can tap into. I think the multi-dc replication uses them, though I'm not positive.
Riak is Open Source. It contains a very complete platform. Riak Core is a dynamo style distributed system platform (not database specific), Riak Pipe is workflows, Riak KV is a KV database, Riak Search is full text search over that database. And there's lot of other stuff I'm not even mentioning (like bitcask, the logging stuff, etc.)
When you go to the Riak project on github, what you find is actually sort of a skeleton, that has as dependancies all those projects I mentioned above, such as riak_kv, riak_pipe, etc.
Riak ES, the commercial offering, is a superset of Riak. It has Riak as a dependency, and adds the feature of cross datacenter replication. I think the real reason you buy Riak ES is because you're wanting to buy support.
Riak ES being a commercial product doesn't make Riak any less open source, than Oracle Server being a commercial product makes Linux less open source.
Also, Basho is keen to develop users of Riak ES, and customers of Riak (who don't spend any money) still get some support from Basho. Basho has a "Riak ES for startups" program, which gives you a huge discount.
I'm building my business on Riak because Riak is open source. IF Basho goes away, I'll still have Riak. There's nothing missing from Riak that I need.
I figure if I get big enough where I want to be running out of multiple data centers, I'll be big enough to afford Riak ES, and if I can't afford Riak ES at that point, then I'll be able to build my own solution. (I don't think it would be that hard, actually.)
I guess what he mean was this: "Open core (a.k.a. proprietary relicensing[1]) is a business model where an open source product is also made available commercially with non-open-source additions" [1]
I cant speak to Riak, but generally this model can create a conflict of interest between the "enterprise features" on the one hand and open source commitments on the other.
For example if someone submits code to the opensource version that duplicates/overlaps an "enterprise feature"
> I figure if I get big enough where I want to be running out of multiple data centers, I'll be big enough to afford Riak ES
I was thinking along those exact same lines, but a big unknown was pricing on their enterprise offering. That information is unavailable on the web, and despite my skepticism in contact-us-for-the-price situations, I filled out their online form, which is a request to be contacted by a representative.
I haven't heard from them, but they did put me on a mailing list—I got an email about this 'milestone release' today! Not quite what I wanted to know, though :)
Nirvana, or someone using their Enterprise offering, perhaps you could fill us all in on the price?
I haven't used Riak, but I did look into it for a project short while back. One problem I had was that the documentation on their website is heavily focused on what Riak is, vs how to use it. It's great that you can get such a fundamental understanding of Riak as-a-dynamo-implementation, and they do a great job writing that stuff, but its completely out of touch with what I expected/needed.
Technically, what eventually put me off, is that I couldn't figure out how to maintain a clean secondary index. If you have a: SiteId, UserId, Data, and you want data to be accessible by SiteId or SiteId+UserId, I couldn't figure out a nice atomic way to maintain the secondary index. This is pretty basic stuff. I'm glad to see 1.0 will support native secondary indexes, but I think my inability to figure it out shows that their documentation is poor (or it could be that I suck).
Bitcask can guarantee one disk seek, whereas LevelDB will do one disk seek per level, so at least from that perspective, it can't be better.
Level also has to look down the entire tree if a key is missing. This means inserts end up being more expensive than reads or updates (which are all just a hash lookup in Bitcask).
"Bitcask can guarantee one disk seek, whereas LevelDB will do one disk seek per level, so at least from that perspective, it can't be better."
Yep, this is a standard tradeoff. When you want your data to be iterable, you have to take the hit. In practice (I oversee a large cassandra cluster), this hit happens about ~1% of the time, which is either a lot, or a little, depending on your constraints.
"Level also has to look down the entire tree if a key is missing."
This is why Cassandra has a bloom filter on top of a very similar data store.
LevelDB is there as the replacement for those who are currently using Innostore as their backend and not for those who have a dataset that fits bitcask.
I imagine it won't be as fast? The cool thing about Bitcask is that all of the keys are in memory - I imagine that would also be beneficial with secondary indexes now supported...
LevelDB seems mostly well suited for data that becomes (in terms of key size and number of keys) bigger than your RAM...
I think it will be a welcome change for anyone who runs a decent sized Riak deployment. We are currently adding machines simply to increase available RAM in the cluster.
Why not bring a node down, and then replace it with a node that has more RAM? Are you exceeding the size of a node you can supply (in terms of RAM) for your cluster?
I'd be very curious to know a bit about the character of your data, the size of your cluster, etc. (I've only run test clusters at this point, so hearing from someone doing production work would be informative.)
Replacement vs. addition is a situational trade-off, but ultimately the problem remains that you need to bring more RAM to the party.
My biggest RAM consumer stores historical data for a goods trading platform. Each trade is a unique key, with all the trade data being the value. Access speed is important, but not as critical as the other goodies I get from Riak (replication and automated rebalancing). Metadata is stored separately, but I hope to change that with Riak 1.0 secondary indexes.
Is anyone here paying for the Enterprise level Riak? I'd love to hear how they charge and whether you think it's worth it. Currently we're looking towards the Denali release later this year but Riak is looking more interesting by the day.
However, I still don't quite get it. What is Riak? It seems to be some sort of Dynamo implementation (like the ironically ill fated Cassandra), but apparently it has a workflow engine? What do people use it for? What is it best at?
Right now we're using PostgreSQL, Redis, and S3. PostgreSQL gives us ACID, Redis gives us fast in-memory access, and S3 gives us an infinite KV store. Is there some reason to use Riak? Would Riak just replace S3?