Hacker News new | past | comments | ask | show | jobs | submit login
Goodbye, CouchDB (saucelabs.com)
254 points by blanu on May 10, 2012 | hide | past | favorite | 142 comments

"No SQL. It’s 2012, and most queries are run from code rather than by a human sitting at a console. Why are we still querying our databases by constructing strings of code in a language most closely related to freaking COBOL, which after being constructed have to be parsed for every single query? SQL in its natural habitat"

COBOL? Really? I don't see the COBOL connection at all.

SQL is more closely related to relational algebra, so it makes absolute sense when you're querying relational data.

So that's why many of us prefer to query our relational data in SQL. It's the same reason why we write stylesheets in CSS instead of C++... and why we validate postal codes and phone numbers with regular expressions instead of "manually" parsing them with PHP, Ruby, Python, or what have you. It's all about the right tool for the right job.

Relational data isn't the right solution for everything; there are lots of use cases where NoSQL databases absolutely rock and traditional databases are inappropriate. Again, right tool for the right job.

The level of ignorance revealed by that one short quote is just stunning.

I have to wonder whether the author is one of those people who writes SQL that reads like procedural code, full of IFs and WHILEs/cursors. I can't count how many times I've converted application-layer-equivalent code like that into briefer, more readable set-based SQL equivalents that perform thousands of times faster. (I say this as someone who is quite happily using MongoDB in my current personal project - but who will almost certainly be piping that data into PostgreSQL for offline analysis. Right tool for the right job, as GP says.)

The problem with SQL is that everyone thinks they know SQL.

So, on one hand, you have a group of people who think they know SQL well enough and dismiss it as inadequate. These are people who build schema without foreign key and run queries without JOIN because they "know better".

On the other hand, you have a group of people who think they know SQL well enough and fall back to it whenever possible. They go on to create views on top of views on top of views, create temp tables to hold intermediate values that get discarded right away, format application output with string concatenation in SELECT, add DISTINCT to SELECT whenever in doubt, etc etc.

We have an interview question (for a senior position) to test if the applicant grasps the concept of "EXISTS/IN (<subquery>)", and thus the concept of set-based SQL. Nobody has got it right yet, though they all can write a ton of SQL.

The connection I see is the attempt to make the syntax English-like. Expressions like "SELECT * FROM users" remind me of COBOL's "ADD X TO Y".

Which is a completely superficial connection with no relationship to the actual semantics.

Very true. I'm tempted to argue that SQL's English-like syntax was a mistake, but only because I think it makes people think you don't really need to learn it. It's frustrating to see people trash something when they don't even properly understand it - a lot of the database-related blog posts I've seen make the front page of HN would be shot down if they were similarly misinformed about a language like Javascript.

I'm not by any means saying that the NoSQL movement as a whole is a fad - there's good motivations behind some of these products - but a lot of people who are just ignorant also seem to have hopped onto the bandwagon.

That was an issue I had with the post: how can "non-relational" be a good thing about CouchDB? If you have relational data, then there's an entire theory (relational algebra) behind ways to interact with it that have clearly defined semantics. I can see that one may decide they don't need that, but I don't see how it's absence can be a good thing. Being non-relational frees CouchDB to provide different kinds of capabilities, but it's those capabilities that are a good thing, not the lack of being relational itself.

Rather, I think the author has confused "relational" with a lot of other properties of relational databased management systems - such as transactions and data integrity guarantees.

You're right that I was lumping transactions and data integrity guarantees in with "relational". I was thinking about normalization, and protecting data integrity in ways that lead to joins and transactions.

You made me think about something. A CoffeeScript like approach to build a saner language that would sit atop SQL would be something definitely worth having. Maybe this could be the start of the "OnSQL" movement. Just my thoughts.

Well, SQLAlchemy is designed just for that. In the words of the author:

> Like "power steering" for SQL. Doesn't teach you how to drive!

Being able to parameterize table and column names in queries would be a big help.

It could have a sharding aware data definition language and some support for "on-the-fly" data migrations.

How is that different from

  for foo in bar
ubiquitous in modern programming languages?

Capital letters.

So SQL is bad because it expresses intent in a language which contains domain concepts?

What would you prefer? Fortran? Assembly?

I didn't say anything about domain-specificity.

My primary point was just that SQL is old, and I think the reason it is the way it is has more to do with history and compatibility than what we'd want it to look like if we started from scratch today.

SQL is still around because it was successful. And there are reasons to value compatibility, and to avoid changing things for no reason. But my personal experience using a database without using SQL was pleasant, and I'm anxious to see the world move on to something new and improved.

There are a lot of things I like about the query interfaces of PetaPoco, ActiveRecord, etc. For simple CRUD operations, they're much prettier and more concise than than embedded SQL - Person.find(123) is a lot nicer than "person = db('select * from person where id=123')"

But beyond very simple cases I feel like non-SQL query interfaces very quickly become terrible, clumsy, leaky abstractions.

I usually try to have the best of both worlds - I create a SQL view/function/sproc containing my big gross gnarly joins, and then query it in a pretty way via the ORM.

I still don't see the similarity. Just about any modern language is full of English keywords - for, while, unless, function, etc.

Unlike COBOL (which I admit to not being very familiar with) SQL uses standard mathematical symbols as operators whenever possible... + instead of "ADD" etc.

Try it without shouting --

     select * from users
Looks more like Haskell now.

"No schemas. This was wonderful. What are schemas even for? They just make things hard to change for no reason."

While they do mention the need to enforce constraints on your data, it's comments like these that make me wish all application developers were required to work as a DBA for a few months.

A properly normalized and "constrained" database prevents data loss from stupid mistakes.

Seriously, after that line I gave up on reading the rest. There's plenty to be said about schema vs no schema but its pretty ignorant to just dismiss the entire concept out of hand.

Guess what - in lots of applications data integrity is more important than developer convenience.

I made it to the next one before I gave up: "Relational databases grew up solving problems where data integrity was paramount and availability was not a big concern."

Yow... is there a word for the special kind of bubble we're in now, with the profusion of Javascript and schema-free datastore-lovin' folks whose lack of experience in static typing and relational databases does not--in even the slightest way--constrain them from pronouncing their irrelevance?

NoOb ?

that's probably already reserved by some new fad where the notion of objects is considered absurd.

It was colorful language, not dismissal out of hand. It can be hard to communicate emotion online, so it's easy to take offense where none was meant. I think that's what happened here.

Definitely agree on communication of emotion, for instance I'm not offended in the least. :) all I'm saying is its hard for me to take someone's opinions on data stores seriously given that quote.

Also, they should be forced to read Fabian Pascal's rants about the end of Western Civilization, aggressive ignorance, and lack of understanding of the Relational Model.


It seems like there is a huge amount of ignorance (on the parts of both developers and DBAs) about the fact that NoSQL doesn't necessarily mean "no schema." For whatever reason, the "cool kids" in the Web dev world completely ignore the ugly stepchildren of the NoSQL world, graph databases. Graph databases provide the protections of a schema, but the schema can be altered without massive pain on the parts of developers. Even better, the RDF style graph databases have a W3C standard query language called SPARQL.

Don't get me wrong, I love my RDBMS (MySQL and PostgreSQL) and I also use Mongo and love its extreme simplicity, but I wish more developers understood that schemas exist in places other than RDBMS.

Yeah, I don't get the anti-schema sentiment either. Coincidentally, I just wrote up my thoughts on it yesterday:


"no schemas" means "schema in the application layer". sometimes its nice for the additional flexibility, but its never as reliable.

And not only that, your application becomes littered with:

if (data.schema_version === 1) { ... } else if (data.schema_version === 2) { ... }


But it is better than not having the ability to change the schema when there are more than a few million rows, thanks to the enormous time taken

That's just a limitation of MySQL, not relational databases in general. In Postgres you can do it live in a transaction.

Schemaless would let you add/remove a field easily. What is hard is if you need to restructure things, which is also probably a more common occurence in a document db then a relational one. If you hit a few million rows, you are in for a world of hurt changing things in any data store

'A properly normalized and "constrained" database prevents data loss from stupid mistakes.'

A properly written application layer also prevents data loss from stupid mistakes. A stupid mistake made while setting up a properly normalized database also causes data loss.

You have to be very smart to be able to design a normalized constrained DB well. The fact that only smart people can do it doesn't mean that people who don't do it aren't smart.

"A properly written application layer also prevents data loss"

Agreed but the application layer _generally_ doesn't have the abstractions at a point where it's trivially easy to put these safe guards in place like it is at the database layer.

Also, "stupid mistakes" does not in any way imply that the people who make them are stupid. Nor am I implying that you have to be particularly intelligent to normalize and constrain a database properly. I'm simply lamenting how undervalued a "good schema" can be.

[side note: upvoted your comment :D]

> You have to be very smart to be able to design a normalized constrained DB well.

Eh, nit picking, but I think "very smart" is overkill--I think "just smart" people should still be capable of designing normalized constrained schemas.

If they are incapable of doing this, then I don't want them writing any code anyway.

Being capable of doing it and choosing not to is either at least forgivable or completely understandable, depending on the situation.

"You have to be very smart to be able to design a normalized constrained DB well."

But you can be a complete moron and write "[a] properly written application layer [that] prevents data loss from stupid mistakes"?

What's the difference? Writing correct code can be hard. I don't think it's particularly easier to apply all your constraint in app code unless you just don't know about the database backend you're using.

It's easier to constrain your objects in the same language they're written in. Say I have an object where my constraint is that either fielda is set, or fieldb and fieldc are set, but not both (ignoring for the moment that that's a stupid object to have). I can trivially enforce that in a constructor, but it would take me quite a while to work out how to express that in SQL, if it's even possible.

I think this reflects more on you than on SQL.

This is fairly trivial to express as a table-level constraint. I've done very similar things in Postgres. I have no idea if you can do this in MySQL, but it's quite crippled.

You absolutely do not have to be "very smart" to design a correct relational database. There are a very small number of very simple, easy to understand rules. It requires making a small effort to educate yourself on the basics of the relational model, and that's it. No special genius required.

What really don't understand is why they chose mysql over postgresql - postgres already have hstore as a column type to store schemaless data. This includes support for indexes on the field.

probably because as they stated they are familar with MySQL. After getting hurt trying out a new technology to them it makes sense they go back to something they know. Will probably lead to more uptime which is what they want.

We were excited to try a NoSQL db, having spent too many years using MySQL in ways that the designers of relational databases never imagined.


Given that we had experience with MySQL and knew it was adequate for our needs, it was hard to justify any other choice.

Agreed. It seems pretty clear from reading the article why they went with MySQL, which you wouldn't know from all of the Postgres butthurt in the comments.

Except they didn't choose mysql -- they chose Percona. Bonus of going with Percona is that in the process you get a fantastic company to back you up.

The author mentioned one of the reasons for choose MySQL is that they are familiar with MySQL. They could've went with PostgreSQL and EnterpriseDB, also a great company. It's not like you can't get commercial support with PostgreSQL.

A "fantastic" company and a crappy database. Sounds like a questionable set of priorities.

I don't like MySQL either, but they're basically using it as a networked hash table. It's not so bad that it can't do that.

This is InnoDB's sweet spot really -- a mostly read-only in memory data set where the lookups are done primarily by PK. MySQL 5.5 can scale this kind of workload to 32 cores.

I'm pretty sure given this kind of workload MySQL will outperform PostgreSQL handily.

And PostgreSQL 9.2 will be able to scale this workload linearly to 64 cores. So while MySQL may or may not win it will certainly not "outperform PostgreSQL handily".


The key is that PostgreSQL 9.2 will be able to handle a 64 core workload, but current released versions of PG do not.

The fact is current versions of PG are unable to use more than 60% CPU on a 24 core machine. Do you know anyone who uses a dev version of an RDBMS in production?


I believe that MySQL 5.5 scales to 32 cores but not linearly, while PostgreSQL 9.1 caps at 24 cores. As I said in my last comment: I do not doubt much that MySQL would beat PostgreSQL 9.1, but it wont beat it "handily".

Microsoft's Extensible Storage Engine (ESE) will also do an extremely fine job at this, and much more.

Why do you say this like it is impressive? Postgresql will scale to 32 cores with a real workload, and has done so for a few years. Mysql performance still tanks at 8 cores. It is very unlikely that mysql will be able to match postgresql for this workload, much less outperform it "handily".

Postgresql will scale to 32 cores with a real workload, and has done so for a few years. Mysql performance still tanks at 8 cores.

Both of those statements are not accurate, but hey, what's it matter? Without benchmarks we're both talking out our ass anyway.

The postgres part of the statement is accurate; http://rhaas.blogspot.de/2012/04/did-i-say-32-cores-how-abou...

No, it's not. The referenced article is talking about Postgres 9.2devel. Version 9.2 isn't out yet, and even if it was, it still wouldn't be true due to the clause "and has done so for a few years".

The lock manager bottlenecks that stopped PG from using more than 60% of the cpu power on a 24 core box were discovered a little less than a year ago.


You are making assumptions about one scenario based on limitations encountered in a very different scenario. The problems that occur around 24 cores occur on benchmarks consisting entirely of select statements against a single table. As I said, postgresql has scaled to 32 cores for real workloads for a few years. Real workloads have more than one table.

See here for an example of mysql having problems at only 8 cores (and postgresql destroying mysql's performance): http://www.scribd.com/doc/551889/Introducing-Freebsd-70

Postgresql scaling to 28 cores in 2007: https://docs.google.com/viewer?a=v&q=cache:-ytn3fY_Lr8J:...

Postgresql on 32 core t2000 being able to scale up to 1024 concurrent clients in 2008: http://www.pgcon.org/2008/schedule/attachments/50_46_pgcon20...

Your MySQL example isn't exactly relevant. It was in 2007, yet you said "MySQL still tanks at 8 cores". Furthermore it was on FreeBSD, with a flawed libpthread.

In terms of "real" workloads, I'm not going to bother getting into it, as this is quickly devolving in to a true Scotsman argument.

Instead of arguing pointlessly about it, maybe our energy would be better spent publishing some benchmarks.

"constructing strings of code...which after being constructed have to be parsed for every single query?"

I don't know about MySQL, but my database caches compiled queries.

"Things like SQL injection attacks simply should not exist."

They don't exist, if you don't construct SQL queries by concatenating strings and variables.

Meanwhile, all the cool kids are talking about getting rid of procedural code in favor of declarative DSLs...

The article sounds like they will be trading one set of problems for another, if they implement SQL on any RDBMS using those techniques.

"They don't exist, if you don't construct SQL queries by concatenating strings and variables."

My point is, people still do this. You never hear about REST-injection or memcached-injection attacks, even though those are possible in principle, because those protocols don't encourage this mistake the way using SQL as a database API does.

> What are schemas even for? They just make things hard to change for no reason.

This attitude right here is why the RDBMS old guard despises NoSQL. Willful ignorance should not be celebrated.

I was floored by this quote. As technical founder, one of my chief priorities operationally is ensuring that a) our systems are up and b) our data is consistent. I've dabbled in NoSQL (a la friendfeed) a bit and my head exploded when I wanted to migrate meta-schemas on multiple machines (potentially writing to the same data) with minimal concurrent alternate code paths. I need to be able to speak with 100% confidence that every object in our database is consistent and valid and without a schema on a database level this is really hard to enforce because anyone (including a bug) could accidentally throw a borked bag of properties into a key value store.

A lot of the excitement over schema-less data stores (I think) really comes down to instant DDL changes, which is why you should really just use Postgres instead of MySQL.

"If you don't like it, it must be because you haven't taken the time to understand it" is cognitive poison. What evidence will convince you that someone has understood well enough to judge that something doesn't make sense?

I've spent many years using schemas, and I know well how they work and what they achieve. I'm saying they're a lousy tradeoff.

I'm attacking what you actually said in the article. You're attacking what your own statement of what you think I believe.

Engineering choices are always tradeoffs. If they're founded on ignorance, they're bad choices--always. Your article is trying too hard to be cute and funny and comes across as ignorant. I assert that promoting uninformed decision-making is both intellectually dishonest and constitutes a much truer form of "cognitive poisoning."

You may well understand what they are and have a well thought out nuanced opinion, but the quote shows none of that. It sounds like an out of hand dismissal of the whole concept of schema which would be pretty ignorant.

*edit I misspelled ignorant... irony alert

The quoted statement was hyperbole. I followed it with as much nuance as I felt it made sense to get into in the broader context of the article.

There's quite a big difference between "no reason" and "lousy tradeoff."

Why did you use mysql rather than postgres? It seems like most of your complaints about mysql are solved in postgres (the query planner is much stronger) and there's some features that seems would fit your team much better

A number of us know MySQL fairly well, and in particular I've seen how it's used by some of the biggest internet companies. We have some postgres experience on our team as well, but it's a little more of an unknown. So experience trumped feature set in this case.

One thing I would say about postgres is that it has a lot of features. As a new user, it's hard to know which ones to use in which ways, and what the downsides might be.

"One thing I would say about postgres is that it has a lot of features. As a new user, it's hard to know which ones to use in which ways, and what the downsides might be."

Do you have some suggestions how that can be improved? Are there features that you would classify as "bloat" or that seem confusing or poorly documented?

Postgres is a very general system and has a large variety of users as a result. That means that the features tend to be very well-thought out and don't carry a lot of surprises, but it also means that it's hard to guide users toward specific usage patterns. Even among web users, a feature like LISTEN/NOTIFY might be an instrumental part of the caching infrastructure for some users, but seem like bloat to others.


Limitations of our universe, such as time, brain capacity and sanity?

To summarize some of the other (upvoted) comments on this cringe-worthy article:

Output to any external system must be encoded to prevent fill-in-the-blank injection, if it uses a language vs a string API-only approach. Used prepared statements.

SQL is not COBOL. Sets != ISAM.

You can store arbitrary data (XML, JSON) in BLOBS/CLOBS in an RDBMS. Denormalization is frowned upon, but not forbidden.

PostgreSQL is arguably a better free / open DBMS than MySQL.

DBMS data constraints are a good thing; use them when they make sense.

that should have been "strong API" (not string)

I've been using CouchDB for the past six months for the internal product-CMS for my business. Its got a great feature set for what I am using it for. MVCC architecture, master-master replication and stored views make it a natural fit as a backend for internal tools. The benefits actually grow as you get more desginers/artists etc working together; each on their own DB.

It seems like all the problems he has are with scaling. I can't comment on that, but I would whole-heartedly recommend it for internal tooling.

As a systems guy, I appreciate stories about developers learning that systems are complicated, and the latest and greatest technology is often not as stable or optimized as hoped.

Although, I do have high hopes for Key:Value store data repositories.


However as a developer guy I am glad they picked a technology that let them make quick prototyping progress and started shipping a product.

Remember only successful companies have to worry about scaling. Un(/not yet)-successful company have to worry about shipping first, then scaling. So perhaps Saucelabs reached that milestone, and they simply have outgrown the original technology that helped them ship.

Who knows, maybe if they had spent time re-implementing a REST interface on top of MySQL or re-implementing Futon for doing debugging, we might not even have heard about Saucelabs these days.

+1 for op and gp. At a previous gig, we had the opportunity of using SQLite or a raw JSON store for a mobile application we were building. Although SQL would've been the "right" way to do it, JSON implementation was basically a no-op given our JS-based framework.

It did make a few things harder later on but we were better equipped to make a change around the time we actually needed to. If we had spent a lot of time up front developing for SQLite, we'd have paid an overhead tax for years for no good reason. And possibly wouldn't have made it to the point in which we needed to make a change.

That said - our use case was incredibly narrow (read only data store, server outputs JSON anyway, etc.) and so we made a reasoned choice.

If we needed complex queries/joins, updates/deletes, incremental loading, etc. - then a JSON store would've been terrible.

Key-value stores like dbm since the 1980s? You're not exactly going out on a limb with those high hopes. :)

This is similar to what FriendFeed did in 2009 - storing schema less data into MySQL: http://backchannel.org/blog/friendfeed-schemaless-mysql

So, the TL;DR version of this is: we stopped using CouchDB because it sucked, but we took all the things that made it great, and shoe-horned them in to MySQL, while negating using anything that an RDBMS is designed for, like joins...

How do you go about searching the DB if everything is stored as a JSON object? Are you using an index like solr/sphinx instead of doing any searches directly in mysql?

We keep things we might need to search in regular columns (typically with indexes). The JSON object is just a way to add extra data to rows, which we can fetch and deal with on the app side. That works fine for a lot of things, and we're hoping it will give us some flexibility around when we need to do schema migrations in some cases.

Your DB should be cognizant of JSON. For instance, JSON records can be converted into key-value combinations that are then indexed.

"guesses wrong about how to perform queries all the time. Experienced MySQL users expect to write a lot of FORCE INDEX clauses."

This is most generally a sign of bad indexing/query construction. I've seen so many databases with dozens of indexs placed on tables (which only had a requirement for a few) because the developers just didn't grasp how they should be setup - which isn't rocket science.

This article explains why I am using MySQL as the DB backend for a site I am currently building. It also explains why I am not using Node.js. New technologies are fun to play with, but they get decidedly less fun when your site starts getting traffic and you realize that your new toy isn't ready for prime time.

It's important to be conservative in the right places. Data storage is one of those places: data is the lifeblood of most businesses, and once it goes bad there's often no way of making it good again. Making sure it's correct and remains correct is critical.

Programs, by comparison, are a lot more flexible here, unless you're in the financial/health/auto/aero industries.

As an aside, where I work we use a lot of node. It works.

It does nothing of the sort, node.js is ready for the prime time, just have a look at transloadit, voxer, yammer etc etc. Don't confuse the tool not being ready, with you not being ready to use the tool. Equally, there are plenty of people are making couchdb work for them.

None of those companies have a website as their main property as I do. The one I think of when it comes to using node is Klout, and Klout's performance is absolutely atrocious.

Voxer does 170m https hits/day and 2 billion http hits/day on node.

>We’re convinced that NoSQL is the future.

The future of what? Non-relational data? Relational databases are very good for a wide variety of problems. And they will continue to be very good for a wide variety of problems.

I wish this article had some hard numbers for availability, performance, and the size of their data as opposed to hand-waving.

Shameless plug: If you're looking to benchmark or load test CouchDB a bit, I wrote one at https://github.com/mgp/iron-cushion. Hopefully someone out there will use this to decide if CouchDB's performance meets their needs, because migrating away from any database is painful...

More a lesson about the pitfalls of building systems using technologies that you don't understand very well than it is anything specific to either CouchDB or MySQL.

More a lesson that requirements change overtime.

If you want to run a NoSQL layer over a relational DB, check out something like Goatfish:


Not production code by any stretch, but an interesting concept, and I'd be more inclined to work on it if more people were using it.

Sounds like he has Lotus Domino-like problems in a product that closely resembles Domino.

Yet in the same page makes fun of SQL for being old and busted? I don't get it.

Notes' main problems are/were the lack of joins and most aggregates, lack of proper transactions, lack of indexes that would permit useful runtime queries, somewhat slow data access layer, and glacially slowly-evolving, ugly-ass UI.

It is also so easy to build on, that people who don't know what they're doing and shouldn't be building any software for redistribution, will do so anyway, and gain just enough success to become extremely annoying. Its IDE is about on par with most other IBM-involved software development exercises, i.e., fairly bad.

Now, all that aside, it is a very powerful, but not generally well-appreciated, system, and is great (fast, cheap, hard to kill) for a large swath of applications that don't require any of the stuff from the first paragraph.

What I really don't get is not using mongo. Would be a natural fit, and besides isn't that link to a "Don't use mongo" you posted a known hoax?

Meh. Single threaded map reduce and very hungry disk usage. Mongo isn't the silver bullet a lot of supporters sometimes project it as (not saying you are).

Also, he actually linked 2 articles, I'd never seen the first one so it may very well be a hoax or whatever. But the 2nd one has made it's rounds a few times and has stood up (IMO) to scrutiny.

tl;dr = "Turns out people who've been doing this longer than us actually know what they're talking about"

They're doing what works for them and good for them for that. But...I think a LOT of people are really missing out by passing over Riak.

Many of the issues they found with CouchDB have been resolved with Riak. I think the sync API for CouchDB is really cool, but Riak has the auto-sharding thing down cold.

Riak runs map reduce queries across multiple nodes, so performance and capability can grow as you add nodes.

CouchDB's views are neat but they impose some constraints that Riak's more dynamic approach resolves (at the cost of possibly running more queries, but these results can be cached easily giving Riak a form of "views" for often run queries.)

I believe Riak's choices for backend are superior to CouchDB's. Further, Riak supports multiple backends so you can choose the one appropriate for your service (including InnoDB, LevelDB and Basho's Bitcask, as well as a super secret hidden gem of a Caching RAM backend.)

Riak now has indexing of data, and queries on these indexes, but I can't compare it to CouchDB. I can say that the feature is close enough for me to not miss SQL.

I think Riak's "view performance" compared to CouchDB should be good, but may not compare to MySQL, but then, we're talking single node performance. Riak is distributed- you need more performance, you just add nodes and point them at the cluster. MySQL requires you to architect a (from my perspective) brittle configuration of servers that can run into SPF issues.

For instance they talk about having a single write master. What happens when a meteor crashes to earth and takes out that machine? Really unlikely, sure, but I have had enough machines have failures (and failures are often really weird) that I don't trust ANY machine to be a single point of failure. ... and when I'm forced to, like being in a single datacenter or having a single network switch, I don't like it, so I avoid it when I can.

Riak has automatic sharding and automatic rebalancing. It loses a node and keeps running. You add nodes and it redistributes around. Riak is an operational dream.

Not to bash CouchDB at all (or MySQL). I think CouchDB is a great product for certain use cases.

I just think a LOT of people are really missing out by passing over Riak.

CouchDB has some features that other databases don't have: a continuous changes feed, REST interface, master-to-master replication, a web interface to the data and management (Futon). We need those features (Yes including Futon. It is a feature because it lets us quickly prototype and debug. It makes a black box that you drop your data in transparent).

But we not using to for large datasets. We are using it mostly for configuration, and setup. There is a custom clustering setup built in its m-m replication and changes feeds.

But I agree that for IO scaling and large data sets Riak would be a top choice. But there are other contenders to look at as well: Cassandra, BigCouch and the upcoming Couchbase Server 2.0

Every database is different, but that doesn't mean there are a lot of really unique features. The unique feature of CouchDB is the way it does views, but that doesn't mean you can't' do views (in fact, in my opinion, better) in other databases.

Continuous changes feed- you can get this with Riak and more importantly you can get a feed of just the relevant changes. Plus you don't need this in Riak the way you do in CouchDB because Riak already has distribution built in.

Master-Master replication as done in CouchDB is inferior to the turely distributed database that Riak is. (Eg: its not replication, it is distributed itself.)

Web interface to data management-- futon was a lead here but there are several tools for Riak that cover these bases in my opinion.

I think CouchDB as a configuration database is an excellent job.

I looked at BigCouch which is taking the Dynamo Ring concept and applying it to CouchDB which is a good solution for couchDB (in fact, they should build it in to the core) ... but that's also what Riak is built from the ground up to be (a dynamo ring.) Cassandra is a different animal and I can never figure out what Couchbase is going to become.

The current plan is to integrate BigCouch into CouchDB.

I haven't seen any articles by users, but Couchbase looks pretty interesting: http://www.couchbase.com/couchdb

See http://blog.couchbase.com/how-couchbase-helped-omgpop-break-... .. not directly by a user but still a real world usage story.

I investigated using Riak for dealing with our metrics a few months ago, but with the data sizes we are dealing with, even the Riak people told us that Hadoop was likely a better solution.

Once you are dealing with more than 500k keys or so, Riak starts to fall over.

EDIT: The 500k key limit pertains to mapreduce jobs, not the overall data size.

I feel somewhat responsible for this confusion, as the guy being quoted here... :-(

Riak will handle billions of keys just fine. We had, I dunno, a half a billion in a six node bitcask-backed cluster and were only at half capacity. Much much bigger installs exist. The limit I was referring to is for a single mapreduce job; Riak MR just isn't well-suited to operations over millions of keys at a time. It can do it, but Riak MR isn't really designed for bulk processing: and I wouldn't be surprised to see MR become unusably slow over millions of keys. You'll get better performance out of Hadoop, generally, for bulk analytics.

The other tough point is key-listing. Listing buckets, listing keys, key filters, MR over buckets, all those features are essentially useless in production. Where the number of keys is large and unguessable it can become a logistical nightmare to keep track of them. 2I key indexes can help, though.

I have a use case, which I don't know if it's common or not.

I want to put millions of items in riak, play with it, and then throw then away.

I might want to do that because I'm testing out something, or because it's the result of some periodic batch processing in production, which I want to get by key later.

Unfortunately, riak doesn't seem to have the notion of a "db", "keyspace" or whatever you want to call it; i.e. something which you could "drop" and that will simply delete a directory with a dozen of files in it (should be quite cheap).

The only thing I can do is to drop the whole riak db, which has the following problems:

1) I have to do it manually on all nodes (stopping the cluster, deleting the files etc)

2) I cannot share a riak cluster between several users/team, so that each user/team can play with a portion of it but there is only a central installtion of the whole cluster. Every application (which I want to be able to drop all the db and recreate it) has to run it's own riak cluster.

Initially I thought that "buckets" were intended to solve this problem, but buckets don't map to a separate storage location, it's just a way to group items. Even listing all buckets present in the db requires scanning all keys and, as the doc says, "Similar to the list keys operation, this requires traversing all keys stored in the cluster and should not be used in production."

Although I've been told that "riak is not designed to do this and that", I'm not sure if these limitations are really technical, or just because the product development effort was targeted at some of the aspects, and these issues could be addressed in a later stage.

Any idea?

Tough call. If you did want to use Riak for fast bucket-drop, your best bet might be to:

a.) Run multiple clusters--not too difficult. Just give each one a different erlang cookie and run em on subsequent ports.

b.) Take bitcask_backend or leveldb_backend and add drop-bucket functionality. Custom backends are more difficult than running multiple clusters, but certainly not impossible. You could build it on top of fold or split writes up into, say, one leveldb per bucket. Don't recall if the vnode interface has drop-bucket so you might have to write some plumbing alongside Riak. jrecursive has done this in Mecha.

If I were building something like this, I might look first at Cassandra or Hbase, or possibly sharded master-slave postgres.

Thank you for your answer. The problems I see with (a) from the top of my head are:

1. Even if it's easy, somebody has to do that. 2. Setting up all the monitoring etc for each instance 3. Running more than one riak daemon on the same machine means that the riak daemon is unaware of the IO operations performed by the other one, hence IO throughput could suffer. This means that in practice you would need to mount separate disk heads (and we are back to 1.) 4. Each riak instance will require some RAM as well, so memory has to be allocated and there is the risk that's over-allocated. 5. Port allocation. I fear it would end up with smth like: "just keep a internal wiki page where each 'db space' is mapped to a port number"

Well, the problem with (b) is of course that I don't have time to do that. For now we stick to cassandra, but Riak is so nice in many aspects that I really hope that at some point, as the product matures, more resources can be invested in aspects which are not currently perceived as "selling points" for riak, but are important for some scenarios and not technically impossible.

Yeah, if you're using Cassandra and the GC/rebalancing issues aren't affecting you, you're probably fine sticking with it. Both are Dynamo-structured, so your consistency/failover model advantages are similar.

That doesn't seem like a very large number. Are you sure?

Yes. I should clarify that I meant 500k keys used in a single m/r job. We needed to be able to run m/r over roughly 200 million keys at the time.

And it turns out you are misrepresenting the situation completely. You can run M/R over key sets in the billions of keys. It sounds like you've not organized your data at all.

You're bashing a product here based on your lack of knowledge, not the products lack of capabilities.

Based on what I've seen for some internal things, that's one claim I'd like to see support for. m/r on Riak has been an unmitigated disaster here for anything beyond incredibly trivial working sets.

I've never seen a Riak MR job over more than 3 million keys complete, on a 6-node SSD cluster. It might be possible, but you'd have to throw a lot more HW at it than the comparable Hadoop setup.

I see you're irrationally proselytizing again.

We just saw you over in the programming languages thread, now you're here, refusing to confront the reality of how broken M/R is in Riak.

What monkey crawls around on your back to make you so confrontational and irrational?

I think your statement is both out of date and quite broad. Plus you imply that a database would be limited to 500k keys which is silly, when you really mean a map-reduce job. And further, are you really doing MR over your entire dataset, all the time, or would key filtering, ranges or secondary indexes be a better fit? Its easy to do M/R in Riak over only the correct amount of data.

Hadoop may have been better for what you are doing, and logging metrics is a particular use case where a specialized database is most appropriate.

But it is incorrect to imply that Riak falls over at any specific key limit. This is simply untrue. With Riak you can always add more nodes if you need more capacity, and Map Reduce is done in a distributed fashion so adding more nodes adds map reduce capacity. Its not perfect but it is not brittle.

You're welcome to peruse the mailing list thread.


That thread shows that all of the particulars of your claims about Riak are actually false. Further it seems you didn't bother to understand how Riak can solve your problem and thus decided that it cannot.

Verbatim, from the mailing list:

"If large-scale mapreduce (more than a few hundred thousand keys) is important, or listing keys is critical, you might consider HBase."

"Riak can also collapse in horrible ways when asked to list huge numbers of keys. Some people say it just gets slow on their large installations. We've actually seen it hang the cluster altogether. Try it and find out!"

I chose the most polite way to point out his error, and now you are compounding it by attempting to rebut me with quotes that don't actually rebut me if you know what you're doing. Listing all keys is a function meant for debugging, not for running in production. If you're running M/R jobs based on that then you don't know what you're doing. The person you're quoting, in fact, said they were doing MR jobs over billions of keys. Further the person who made that recommendation doesn't work for Basho, and that they said he should consider HBase is not the same as saying that Riak can't do it.

You want to say I'm wrong, make a specific argument. Don't selectively quote things out of context that actually don't rebut my position, as that's profoundly dishonest. It is a way of pretending to rebut someone but without saying anything yourself so you can't be pinned on any statements. It is disingenuous.

I'm really tired of having to rebut these argument-from-ignorance "rebuttals" here on HN.

"The person you're quoting, in fact, said they were doing MR jobs over billions of keys"

Ctrl-F billions and found one match in the post I was quoting. No other reference to very large MR jobs in the post quoted.

"At Showyou, we're also building a custom backend called Mecha which integrates Riak and SOLR, specifically for this kind of analytics over billions of keys. We haven't packaged it for open-source release yet"

So the OP is supposed to use an unreleased experimental custom backend to do his big mapreduce jobs?

I am the person being quoted. You are correct that keylisting is not suitable for production use. We definitely don't do MR jobs over billions of keys: our huge data queries are powered by Mecha, which uses Solr.

I looked at Riak a bit (we use CouchDB and Redis at our startup). It seemed to not scale down as nicely as CouchDB: no Futon, binary API's and a lot of emphasis in the documentation on sharding stuff (where we just run CouchDB on a single server with a backup server where we continuously-replicate to).

Does riak support range queries?

It does on secondary indexes http://wiki.basho.com/Secondary-Indexes.html

I should mention that 2I performance may be a little slow, depending on what kind of indices and queries you need. It's not hard to try it out and benchmark, though.

Yes, it does range queries based on key, if the key follows a defined format.

It also has secondary indexes.

One of the nice features of CouchDB is that the views are incrementally updated, ideal if you have a large dataset that changes frequently with little changes and you want frequently get the most up to date transformed data.

Looking into Riak, I am under the impression there is some caching done on parts of their map reduce system, but I couldn't really find a lot of advice on the performance characteristics of frequently running a map reduce over a large, slowly-changing dataset.

Does anyone else have experience using Riak for this kind of thing?

In short, don't. In Riak you would aim to update those views at write time, to minimize the number of reads required.

I think the eventual consistency is what narrows riaks use case. Only heard good things about it though

I see this a lot. Devs decide to stop using something because they don't like the interface it presents to developers, but fail to seriously (seriously, seriously) consider how its replacement will run in reality. That is, they like the idea of using something new, but are not prepared for the reality of actually using it. (In fairness to these guys, it seems more like the reality of their situation changed rather than just being short-sighted.)

So, I get the squicks now whenever I see someone talking about how lame and broken an old, mature technology is. The way this article shits on schemas, for example -- if a coworker said that to me in real life I'd get a sinking feeling in the pit of my stomach. It's a short hop from that type of thing into the land of the straight-up cowboy coder.

Being one of those idiots who went all in with MongoDB with our startup, I can relate. NoSQL should really be called NoDB. There will come a point where you ask, "Dude, where is my database?"

Riak looks interesting but its overkill. They recommend at least three nodes. We went back to PostgreSQL.

Shrug, I never bothered to say hello.

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact