
Goodbye, CouchDB - blanu
http://saucelabs.com/blog/index.php/2012/05/goodbye-couchdb/
======
JohnBooty
"No SQL. It’s 2012, and most queries are run from code rather than by a human
sitting at a console. Why are we still querying our databases by constructing
strings of code in a language most closely related to freaking COBOL, which
after being constructed have to be parsed for every single query? SQL in its
natural habitat"

COBOL? Really? I don't see the COBOL connection at all.

SQL is more closely related to relational algebra, so it makes absolute sense
when you're querying relational data.

So that's why many of us prefer to query our relational data in SQL. It's the
same reason why we write stylesheets in CSS instead of C++... and why we
validate postal codes and phone numbers with regular expressions instead of
"manually" parsing them with PHP, Ruby, Python, or what have you. It's all
about the right tool for the right job.

Relational data isn't the right solution for everything; there are lots of use
cases where NoSQL databases absolutely rock and traditional databases are
inappropriate. Again, right tool for the right job.

~~~
sah
The connection I see is the attempt to make the syntax English-like.
Expressions like "SELECT * FROM users" remind me of COBOL's "ADD X TO Y".

~~~
scott_s
Which is a completely superficial connection with no relationship to the
actual semantics.

~~~
AlisdairO
Very true. I'm tempted to argue that SQL's English-like syntax was a mistake,
but only because I think it makes people think you don't really need to learn
it. It's frustrating to see people trash something when they don't even
properly understand it - a lot of the database-related blog posts I've seen
make the front page of HN would be shot down if they were similarly
misinformed about a language like Javascript.

I'm not by any means saying that the NoSQL movement as a whole is a fad -
there's good motivations behind some of these products - but a lot of people
who are just ignorant also seem to have hopped onto the bandwagon.

~~~
itaborai83
You made me think about something. A CoffeeScript like approach to build a
saner language that would sit atop SQL would be something definitely worth
having. Maybe this could be the start of the "OnSQL" movement. Just my
thoughts.

~~~
DennisP
Being able to parameterize table and column names in queries would be a big
help.

~~~
itaborai83
It could have a sharding aware data definition language and some support for
"on-the-fly" data migrations.

------
johnbender
"No schemas. This was wonderful. What are schemas even for? They just make
things hard to change for no reason."

While they do mention the need to enforce constraints on your data, it's
comments like these that make me wish all application developers were required
to work as a DBA for a few months.

A properly normalized and "constrained" database prevents data loss from
stupid mistakes.

~~~
jmathes
'A properly normalized and "constrained" database prevents data loss from
stupid mistakes.'

A properly written application layer also prevents data loss from stupid
mistakes. A stupid mistake made while setting up a properly normalized
database also causes data loss.

You have to be very smart to be able to design a normalized constrained DB
well. The fact that only smart people can do it doesn't mean that people who
don't do it aren't smart.

~~~
autarch
"You have to be very smart to be able to design a normalized constrained DB
well."

But you can be a complete moron and write "[a] properly written application
layer [that] prevents data loss from stupid mistakes"?

What's the difference? Writing correct code can be hard. I don't think it's
particularly easier to apply all your constraint in app code unless you just
don't know about the database backend you're using.

~~~
lmm
It's easier to constrain your objects in the same language they're written in.
Say I have an object where my constraint is that either fielda is set, or
fieldb and fieldc are set, but not both (ignoring for the moment that that's a
stupid object to have). I can trivially enforce that in a constructor, but it
would take me quite a while to work out how to express that in SQL, if it's
even possible.

~~~
autarch
I think this reflects more on you than on SQL.

This is fairly trivial to express as a table-level constraint. I've done very
similar things in Postgres. I have no idea if you can do this in MySQL, but
it's quite crippled.

------
yxhuvud
What really don't understand is why they chose mysql over postgresql -
postgres already have hstore as a column type to store schemaless data. This
includes support for indexes on the field.

~~~
trimbo
Except they didn't choose mysql -- they chose _Percona_. Bonus of going with
Percona is that in the process you get a fantastic company to back you up.

~~~
moe
A "fantastic" company and a crappy database. Sounds like a questionable set of
priorities.

~~~
JoachimSchipper
I don't like MySQL either, but they're basically using it as a networked hash
table. It's not _so_ bad that it can't do _that_.

~~~
spudlyo
This is InnoDB's sweet spot really -- a mostly read-only in memory data set
where the lookups are done primarily by PK. MySQL 5.5 can scale this kind of
workload to 32 cores.

I'm pretty sure given this kind of workload MySQL will outperform PostgreSQL
handily.

~~~
jeltz
And PostgreSQL 9.2 will be able to scale this workload linearly to 64 cores.
So while MySQL may or may not win it will certainly not "outperform PostgreSQL
handily".

[http://rhaas.blogspot.se/2012/04/did-i-say-32-cores-how-
abou...](http://rhaas.blogspot.se/2012/04/did-i-say-32-cores-how-
about-64.html)

~~~
spudlyo
The key is that PostgreSQL 9.2 _will be_ able to handle a 64 core workload,
but current released versions of PG do not.

The fact is current versions of PG are unable to use more than 60% CPU on a 24
core machine. Do you know anyone who uses a dev version of an RDBMS in
production?

[http://archives.postgresql.org/message-
id/BANLkTimVboKxzGS9B...](http://archives.postgresql.org/message-
id/BANLkTimVboKxzGS9BhL7XphBCr4iy-s0BQ@mail.gmail.com)

~~~
jeltz
I believe that MySQL 5.5 scales to 32 cores but not linearly, while PostgreSQL
9.1 caps at 24 cores. As I said in my last comment: I do not doubt much that
MySQL would beat PostgreSQL 9.1, but it wont beat it "handily".

------
DennisP
"constructing strings of code...which after being constructed have to be
parsed for every single query?"

I don't know about MySQL, but my database caches compiled queries.

"Things like SQL injection attacks simply should not exist."

They don't exist, if you don't construct SQL queries by concatenating strings
and variables.

Meanwhile, all the cool kids are talking about getting rid of procedural code
in favor of declarative DSLs...

~~~
sah
_"They don't exist, if you don't construct SQL queries by concatenating
strings and variables."_

My point is, people still do this. You never hear about REST-injection or
memcached-injection attacks, even though those are possible in principle,
because those protocols don't encourage this mistake the way using SQL as a
database API does.

~~~
JoachimSchipper
Ahem. <http://news.ycombinator.com/item?id=3369876>.

------
fusiongyro
> What are schemas even for? They just make things hard to change for no
> reason.

This attitude right here is why the RDBMS old guard despises NoSQL. Willful
ignorance should not be celebrated.

~~~
sah
"If you don't like it, it must be because you haven't taken the time to
understand it" is cognitive poison. What evidence will convince you that
someone has understood well enough to judge that something doesn't make sense?

I've spent many years using schemas, and I know well how they work and what
they achieve. I'm saying they're a lousy tradeoff.

~~~
baconner
You may well understand what they are and have a well thought out nuanced
opinion, but the quote shows none of that. It sounds like an out of hand
dismissal of the whole concept of schema which would be pretty ignorant.

*edit I misspelled ignorant... irony alert

~~~
sah
The quoted statement was hyperbole. I followed it with as much nuance as I
felt it made sense to get into in the broader context of the article.

------
kanja
Why did you use mysql rather than postgres? It seems like most of your
complaints about mysql are solved in postgres (the query planner is much
stronger) and there's some features that seems would fit your team much better

~~~
sah
A number of us know MySQL fairly well, and in particular I've seen how it's
used by some of the biggest internet companies. We have some postgres
experience on our team as well, but it's a little more of an unknown. So
experience trumped feature set in this case.

One thing I would say about postgres is that it has a lot of features. As a
new user, it's hard to know which ones to use in which ways, and what the
downsides might be.

~~~
asnyder
Research?

~~~
Myrth
Limitations of our universe, such as time, brain capacity and sanity?

------
Roboprog
To summarize some of the other (upvoted) comments on this cringe-worthy
article:

Output to _any_ external system must be encoded to prevent fill-in-the-blank
injection, if it uses a language vs a string API-only approach. Used prepared
statements.

SQL is _not_ COBOL. Sets != ISAM.

You can store arbitrary data (XML, JSON) in BLOBS/CLOBS in an RDBMS.
Denormalization is frowned upon, but not forbidden.

PostgreSQL is arguably a better free / open DBMS than MySQL.

DBMS data constraints are a good thing; use them when they make sense.

~~~
Roboprog
that should have been "strong API" (not string)

------
Androsynth
I've been using CouchDB for the past six months for the internal product-CMS
for my business. Its got a great feature set for what I am using it for. MVCC
architecture, master-master replication and stored views make it a natural fit
as a backend for internal tools. The benefits actually grow as you get more
desginers/artists etc working together; each on their own DB.

It seems like all the problems he has are with scaling. I can't comment on
that, but I would whole-heartedly recommend it for internal tooling.

------
josephkern
As a systems guy, I appreciate stories about developers learning that systems
are complicated, and the latest and greatest technology is often not as stable
or optimized as hoped.

Although, I do have high hopes for Key:Value store data repositories.

TANSTAAFL.

~~~
rdtsc
However as a developer guy I am glad they picked a technology that let them
make quick prototyping progress and started shipping a product.

Remember only successful companies have to worry about scaling. Un(/not
yet)-successful company have to worry about shipping first, then scaling. So
perhaps Saucelabs reached that milestone, and they simply have outgrown the
original technology that helped them ship.

Who knows, maybe if they had spent time re-implementing a REST interface on
top of MySQL or re-implementing Futon for doing debugging, we might not even
have heard about Saucelabs these days.

~~~
MattRogish
+1 for op and gp. At a previous gig, we had the opportunity of using SQLite or
a raw JSON store for a mobile application we were building. Although SQL
would've been the "right" way to do it, JSON implementation was basically a
no-op given our JS-based framework.

It did make a few things harder later on but we were better equipped to make a
change around the time we actually needed to. If we had spent a lot of time up
front developing for SQLite, we'd have paid an overhead tax for _years_ for no
good reason. And possibly wouldn't have made it to the point in which we
needed to make a change.

That said - our use case was incredibly narrow (read only data store, server
outputs JSON anyway, etc.) and so we made a reasoned choice.

If we needed complex queries/joins, updates/deletes, incremental loading, etc.
- then a JSON store would've been terrible.

------
wave
This is similar to what FriendFeed did in 2009 - storing schema less data into
MySQL: <http://backchannel.org/blog/friendfeed-schemaless-mysql>

------
dpcx
So, the TL;DR version of this is: we stopped using CouchDB because it sucked,
but we took all the things that made it great, and shoe-horned them in to
MySQL, while negating using anything that an RDBMS is designed for, like
joins...

------
balac
How do you go about searching the DB if everything is stored as a JSON object?
Are you using an index like solr/sphinx instead of doing any searches directly
in mysql?

~~~
sah
We keep things we might need to search in regular columns (typically with
indexes). The JSON object is just a way to add extra data to rows, which we
can fetch and deal with on the app side. That works fine for a lot of things,
and we're hoping it will give us some flexibility around when we need to do
schema migrations in some cases.

------
xd
"guesses wrong about how to perform queries all the time. Experienced MySQL
users expect to write a lot of FORCE INDEX clauses."

This is most generally a sign of bad indexing/query construction. I've seen so
many databases with dozens of indexs placed on tables (which only had a
requirement for a few) because the developers just didn't grasp how they
should be setup - which isn't rocket science.

------
balloot
This article explains why I am using MySQL as the DB backend for a site I am
currently building. It also explains why I am not using Node.js. New
technologies are fun to play with, but they get decidedly less fun when your
site starts getting traffic and you realize that your new toy isn't ready for
prime time.

~~~
sandfox
It does nothing of the sort, node.js is ready for the prime time, just have a
look at transloadit, voxer, yammer etc etc. Don't confuse the tool not being
ready, with you not being ready to use the tool. Equally, there are plenty of
people are making couchdb work for them.

~~~
balloot
None of those companies have a website as their main property as I do. The one
I think of when it comes to using node is Klout, and Klout's performance is
absolutely atrocious.

~~~
baudehlo
Voxer does 170m https hits/day and 2 billion http hits/day on node.

------
macspoofing
>We’re convinced that NoSQL is the future.

The future of what? Non-relational data? Relational databases are very good
for a wide variety of problems. And they will continue to be very good for a
wide variety of problems.

------
shadowmatter
I wish this article had some hard numbers for availability, performance, and
the size of their data as opposed to hand-waving.

Shameless plug: If you're looking to benchmark or load test CouchDB a bit, I
wrote one at <https://github.com/mgp/iron-cushion>. Hopefully someone out
there will use this to decide if CouchDB's performance meets their needs,
because migrating away from any database is painful...

------
ams6110
More a lesson about the pitfalls of building systems using technologies that
you don't understand very well than it is anything specific to either CouchDB
or MySQL.

~~~
abraham
More a lesson that requirements change overtime.

------
StavrosK
If you want to run a NoSQL layer over a relational DB, check out something
like Goatfish:

<https://github.com/stochastic-technologies/goatfish>

Not production code by any stretch, but an interesting concept, and I'd be
more inclined to work on it if more people were using it.

------
NDizzle
Sounds like he has Lotus Domino-like problems in a product that closely
resembles Domino.

Yet in the same page makes fun of SQL for being old and busted? I don't get
it.

~~~
sk5t
Notes' main problems are/were the lack of joins and most aggregates, lack of
proper transactions, lack of indexes that would permit useful runtime queries,
somewhat slow data access layer, and glacially slowly-evolving, ugly-ass UI.

It is also so easy to build on, that people who don't know what they're doing
and shouldn't be building any software for redistribution, will do so anyway,
and gain just enough success to become extremely annoying. Its IDE is about on
par with most other IBM-involved software development exercises, i.e., fairly
bad.

Now, all that aside, it is a very powerful, but not generally well-
appreciated, system, and is great (fast, cheap, hard to kill) for a large
swath of applications that don't require any of the stuff from the first
paragraph.

------
vccarvalho
What I really don't get is not using mongo. Would be a natural fit, and
besides isn't that link to a "Don't use mongo" you posted a known hoax?

~~~
encoderer
Meh. Single threaded map reduce and very hungry disk usage. Mongo isn't the
silver bullet a lot of supporters sometimes project it as (not saying you
are).

Also, he actually linked 2 articles, I'd never seen the first one so it may
very well be a hoax or whatever. But the 2nd one has made it's rounds a few
times and has stood up (IMO) to scrutiny.

------
puppybeard
tl;dr = "Turns out people who've been doing this longer than us actually know
what they're talking about"

------
nirvana
They're doing what works for them and good for them for that. But...I think a
LOT of people are really missing out by passing over Riak.

Many of the issues they found with CouchDB have been resolved with Riak. I
think the sync API for CouchDB is really cool, but Riak has the auto-sharding
thing down cold.

Riak runs map reduce queries across multiple nodes, so performance and
capability can grow as you add nodes.

CouchDB's views are neat but they impose some constraints that Riak's more
dynamic approach resolves (at the cost of possibly running more queries, but
these results can be cached easily giving Riak a form of "views" for often run
queries.)

I believe Riak's choices for backend are superior to CouchDB's. Further, Riak
supports multiple backends so you can choose the one appropriate for your
service (including InnoDB, LevelDB and Basho's Bitcask, as well as a super
secret hidden gem of a Caching RAM backend.)

Riak now has indexing of data, and queries on these indexes, but I can't
compare it to CouchDB. I can say that the feature is close enough for me to
not miss SQL.

I think Riak's "view performance" compared to CouchDB should be good, but may
not compare to MySQL, but then, we're talking single node performance. Riak is
distributed- you need more performance, you just add nodes and point them at
the cluster. MySQL requires you to architect a (from my perspective) brittle
configuration of servers that can run into SPF issues.

For instance they talk about having a single write master. What happens when a
meteor crashes to earth and takes out that machine? Really unlikely, sure, but
I have had enough machines have failures (and failures are often really weird)
that I don't trust ANY machine to be a single point of failure. ... and when
I'm forced to, like being in a single datacenter or having a single network
switch, I don't like it, so I avoid it when I can.

Riak has automatic sharding and automatic rebalancing. It loses a node and
keeps running. You add nodes and it redistributes around. Riak is an
operational dream.

Not to bash CouchDB at all (or MySQL). I think CouchDB is a great product for
certain use cases.

I just think a LOT of people are really missing out by passing over Riak.

~~~
mpd
I investigated using Riak for dealing with our metrics a few months ago, but
with the data sizes we are dealing with, even the Riak people told us that
Hadoop was likely a better solution.

Once you are dealing with more than 500k keys or so, Riak starts to fall over.

EDIT: The 500k key limit pertains to mapreduce jobs, not the overall data
size.

~~~
aphyr
I feel somewhat responsible for this confusion, as the guy being quoted
here... :-(

Riak will handle billions of keys just fine. We had, I dunno, a half a billion
in a six node bitcask-backed cluster and were only at half capacity. Much much
bigger installs exist. The limit I was referring to is for a single mapreduce
job; Riak MR just isn't well-suited to operations over millions of keys at a
time. It _can_ do it, but Riak MR isn't really designed for bulk processing:
and I wouldn't be surprised to see MR become unusably slow over millions of
keys. You'll get better performance out of Hadoop, generally, for bulk
analytics.

The other tough point is key-listing. Listing buckets, listing keys, key
filters, MR over buckets, all those features are essentially useless in
production. Where the number of keys is large and unguessable it can become a
logistical nightmare to keep track of them. 2I key indexes can help, though.

~~~
ithkuil
I have a use case, which I don't know if it's common or not.

I want to put millions of items in riak, play with it, and then throw then
away.

I might want to do that because I'm testing out something, or because it's the
result of some periodic batch processing in production, which I want to get by
key later.

Unfortunately, riak doesn't seem to have the notion of a "db", "keyspace" or
whatever you want to call it; i.e. something which you could "drop" and that
will simply delete a directory with a dozen of files in it (should be quite
cheap).

The only thing I can do is to drop the whole riak db, which has the following
problems:

1) I have to do it manually on all nodes (stopping the cluster, deleting the
files etc)

2) I cannot share a riak cluster between several users/team, so that each
user/team can play with a portion of it but there is only a central
installtion of the whole cluster. Every application (which I want to be able
to drop all the db and recreate it) has to run it's own riak cluster.

Initially I thought that "buckets" were intended to solve this problem, but
buckets don't map to a separate storage location, it's just a way to group
items. Even listing all buckets present in the db requires scanning all keys
and, as the doc says, "Similar to the list keys operation, this requires
traversing all keys stored in the cluster and should not be used in
production."

Although I've been told that "riak is not designed to do this and that", I'm
not sure if these limitations are really technical, or just because the
product development effort was targeted at some of the aspects, and these
issues could be addressed in a later stage.

Any idea?

~~~
aphyr
Tough call. If you did want to use Riak for fast bucket-drop, your best bet
might be to:

a.) Run multiple clusters--not too difficult. Just give each one a different
erlang cookie and run em on subsequent ports.

b.) Take bitcask_backend or leveldb_backend and add drop-bucket functionality.
Custom backends are more difficult than running multiple clusters, but
certainly not impossible. You could build it on top of fold or split writes up
into, say, one leveldb per bucket. Don't recall if the vnode interface _has_
drop-bucket so you might have to write some plumbing alongside Riak.
jrecursive has done this in Mecha.

If I were building something like this, I might look first at Cassandra or
Hbase, or possibly sharded master-slave postgres.

~~~
ithkuil
Thank you for your answer. The problems I see with (a) from the top of my head
are:

1\. Even if it's easy, somebody has to do that. 2\. Setting up all the
monitoring etc for each instance 3\. Running more than one riak daemon on the
same machine means that the riak daemon is unaware of the IO operations
performed by the other one, hence IO throughput could suffer. This means that
in practice you would need to mount separate disk heads (and we are back to
1.) 4\. Each riak instance will require some RAM as well, so memory has to be
allocated and there is the risk that's over-allocated. 5\. Port allocation. I
fear it would end up with smth like: "just keep a internal wiki page where
each 'db space' is mapped to a port number"

Well, the problem with (b) is of course that I don't have time to do that. For
now we stick to cassandra, but Riak is so nice in many aspects that I really
hope that at some point, as the product matures, more resources can be
invested in aspects which are not currently perceived as "selling points" for
riak, but are important for some scenarios and not technically impossible.

~~~
aphyr
Yeah, if you're using Cassandra and the GC/rebalancing issues aren't affecting
you, you're probably fine sticking with it. Both are Dynamo-structured, so
your consistency/failover model advantages are similar.

------
moron
I see this a lot. Devs decide to stop using something because they don't like
the interface it presents to developers, but fail to seriously (seriously,
seriously) consider how its replacement will run in reality. That is, they
like the idea of using something new, but are not prepared for the reality of
actually using it. (In fairness to these guys, it seems more like the reality
of their situation changed rather than just being short-sighted.)

So, I get the squicks now whenever I see someone talking about how lame and
broken an old, mature technology is. The way this article shits on schemas,
for example -- if a coworker said that to me in real life I'd get a sinking
feeling in the pit of my stomach. It's a short hop from that type of thing
into the land of the straight-up cowboy coder.

------
no-espam
Being one of those idiots who went all in with MongoDB with our startup, I can
relate. NoSQL should really be called NoDB. There will come a point where you
ask, "Dude, where is my database?"

Riak looks interesting but its overkill. They recommend at least three nodes.
We went back to PostgreSQL.

------
latchkey
Shrug, I never bothered to say hello.

