

The dark side of NoSQL - tjogin
http://codemonkeyism.com/dark-side-nosql/
There is a dark side to most of the current NoSQL databases. People rarely talk about it. They talk about performance, about how easy schemaless databases are to use. About nice APIs. They are mostly developers and not operation and system administrators. No-one asks those. But it’s there where rubber hits the road.
======
jlouis
I've been migrating data from a NoSQL database into PostgreSQL for the past
few weeks. Now data occupies about half the size on disk. And queries are up
by a factor of 100. Especially the advanced ones. Note however, that I have a
pretty good understanding of SQL, so I know how to optimize the queries.

One powerful aspect of SQL is that the database _can_ optimize the query. Some
NoSQL databases enforces you to write joins manually in code. Unless you
really know what you are doing and spend hours on it, the query will be slow.
This is exactly what I see with many of our queries. Relational databases is
not the answer to everything, but I would prefer to start with one rather than
a NoSQL variant. You can always migrate when things begin to get too slow (ie,
when you get mad success - which you probably never do anyway)

~~~
mahmud
_I've been migrating data from a NoSQL database_

That's like saying "I live in a capitalist nation and capitalism is not what
it's cracked up to be."

NoSQL is an ideology, and anyone you hear who uses that term is an ideologue.
Please name the "NoSQL" you have been using and name the specific problems you
have had with it so we can discuss matters more concretely.

Just as RDBMS users prefer one above all else, so do "NoSQL" users. I have
seen my fair share of them and I thought a good chunk of them where "crap"
(where crap means "does not fit my mental model of the solution"; mainly
because I am too naive to study the field better and appreciate what the tools
are capable of, so I dismiss anything that doesn't entirely model CL's
Elephant.)

Last week I rolled out my own Memcached "replacement" in Common Lisp and I was
spitting in bradfitz's general direction, that looney bloat monger! Yeah, my
"memcached" replacement is an in-process memoizer for all the DB accessing
functions. It worked until the server instance was overloaded and the load-
balancer forked a second instance: then I had a problem. I wish I could suck-
back all the spit I have sent in brad's general direction.

~~~
jbellis
> NoSQL is an ideology, and anyone you hear who uses that term is an
> ideologue.

You're sounding a little ideologue-ish yourself. :)

NoSQL is a term of limited use since there is such a huge variety of non-
relational database technologies, but that doesn't mean using it makes you an
ideologue any more than someone using the term "dynamic languages," which is
similarly broad.

~~~
blasdel
...and similarly naive / nondescriptive / misfocused

~~~
mahmud
btw, it's ironic that the author of the story was complaining about the lack
of querying language, when, in fact, "NoSQL" stands for "No Structured Query
Language" :-D

------
cturner
I've had an unusual problem with couchdb. A bad query can flatten your host.
The engine uses every bit of performance in pursuit of its answer, which might
be the workload equivalent of a series of stupid open joins in SQL. Your bash
consoles become unresponsive. The same mistake with a traditional relational
database causes less damage because those databases are less effective at
harvesting the power of the host.

There's a training obstacle. I know how SQL works and how to make SQL perform
well almost by instinct. NoSQL requires me to relearn data querying. The lack
of standardisation across offerings means that the knowledge I acquire will be
less transferable what I've accumulated from
Oracle/Postgresql/Openbase/Mysql/Sqlite.

~~~
Maro
You should think of CouchDB as the Web2.0 equivalent of Microsoft Access. It's
good for running a simple web-app off, because you can store the HTML, JS and
the data all in CouchDB (I think). So one use-case may be a _simple_ billing
system for your local surf shop that is sure as hell not going to store a lot
of data or handle a lot of load. I got this impression from a Damien Katz
talk. (I think the use-case they use is a recipe app.)

~~~
peregrine
Well with the load balancer, and continuous replication you could
theoretically scale to any level easily.

~~~
jbellis
No. CouchDB doesn't support partitioning, so all your replicas need to be able
to handle the full write volume.

~~~
boorad
It will soon. We have an 8-node cluster running in PROD right now. A Dynamo-
like layer provides consistent hashing, partitioning, read/write quorum
constants and the like.

------
bscofield
One of the complaints that people have about the NoSQL label is that it
attempts collect a huge variety of databases that don't fit the traditional
label under one name. The problem with that is illustrated in this article,
which does a great job of describing problems that some of the NoSQL
alternatives experience – but the problems here don't apply nearly as strongly
to document-oriented databases, or to graph databases. I think the title's
overstating the argument, in other words.

Now if it were "the dark side of column-oriented databases," Bob's your uncle.

------
antirez
This are real problems. With Redis I tried to mitigate some of this problems
by providing an API that allows to retrieve the full state of the database
just using available commands (without having to mess with the binary dump)
including for instance reading the TTL of an expiring key. This allows to
create scripts that are able to convert a full database into a CSV file, and
restore back the CSV file into the DB, just using the net API.

Of course with key/value stores changing the structure of the database is
harder compared to SQL databases, still designing a good data layout when
building your application can help a lot to make it simple to add features to
the application without to resort to scripts that rebuild the data in the
key/value store.

~~~
leej
i agree with data export point. @antirez if we could clone you and distribute
to other KV projects then all KV datastores skyrocket in popularity. first and
foremost issue in KV popularity is ability to try quickly. if there would be a
small php script that converts an sql dump into a KV datastore (and vice
versa) then much more people will be trying and they know that if it wouldnt
work out they can easily opt out. i hope you'll continue to be thoughtful for
others who are not C++ or Java wizards. thanks!

------
Maro
I'm one of the lead programmers of Keyspace, which is a key-value store. The
main strength of Keyspace is the consistent replication, but right now I'll
tell you what my experiences are building a web app (real-time search) using
Keyspace. In other words, what happened when I drank my own kool-aid.

1\. Schemaless. This is a half-truth. In reality there is always some kind of
schema, otherwise you'd get chaos. What works for us in Keyspace, where
everything is a string, is to use "/" as a separator convention. So we have
keys like: "/user/id:001" => <data>

More complicated example:

"/user/id:001/view/keyword:keyspace/"

2\. Listing. Since listing happens on a prefix basis in Keyspace (give me
everything that starts with "/user/id:001"), if you just want to get a list of
some properties of the user, you have to create the list explicitly, otherwise
you'll get all the keys under the user. Eg.

"/user/id:001/views/list/keyword:scalien" => null
"/user/id:001/views/list/keyword:keyspace" => null

3\. If you need to list by property X, you need to create a subtree for it, as
above. This sucks if you figure this out at a later point, because you have to
programatically parse your existing data and create it, besides adding it to
the online code for new items. That's why RDBMSs are so good.

4\. Backups. No good backup strategy at this point, hot or cold.

..

So why should anyone put up with this? You have to realize there's lots of
hype right now, in reality KV stores should only be used by a small fraction
of DB users. If an RDBMS is Java/C++, then a KV store is (can be) Assembly.
You have much more control over what's going on. (You can of course get the
same control by having a giant key-value table in your RDBMS.) You can put a
cache layer in your stack more easily.

So you should think of a good KV store as something that you use under certain
hi-performance cases, and even then, the KV store should offer some plus over
having a giant key-value table in My/Postresql. Usually, that plus is some
kind of distribution or replication that usually works. In case of Keyspace
it's replication.

The other big plus is, as I learned, is that programming in the KV model is
_much much more_ natural then having embedded SQL statements in your code. A
typical segment in my (Python) code is:

users = conn.listkeys("/user/id:%s/views/" % user_id) for (...):

This is much nicer then having SQL statements, and since you're using a KV
store, it can't get more complicated than this. As I said, I assume you're
doing hi-performance stuff, in which case some crappy ORM layer is out of the
question anyway.

~~~
msluyter
Have you made, or you aware of, any comparisons between a KV store such as
this and the same thing implemented as an IOT in Oracle?

~~~
Maro
No and no.

Here's a link about IOT for others:

[http://www.oracle.com/technology/products/oracle9i/datasheet...](http://www.oracle.com/technology/products/oracle9i/datasheets/iots/iot_ds.html)

Personally I think there's a fairly large cultural disconnect between shops
running open-source stacks and shops running Oracle, so such a comparison
would not influence user choices much. But it would be very interesting for an
implementor like me: if Oracle is faster, which is likely at this point, it
would set a performance point to beat.

As a final fun fact, Keyspace is currently running on top of BerkeleyDB, which
is an Oracle property =) But not for much longer, I'm very unsatisfied with
BDB, it doesn't live up to its image of being an industrial strength btree
store (it sucks).

~~~
gcv
What do you dislike about Berkeley DB? I've been using it for a little while
for a pet project, and it seems decent. I'd like it to support range-based
join cursors, but that's not a major complaint. Granted, I haven't yet had to
deal with database recovery, so I don't know how it stacks up there.

Also --- what do you plan to replace it with? I know that AllegroCache started
out based on BDB, and ended up not using it. The Franz guys said performance
wasn't up to par, and I always wished for more details.

------
jrockway
"NoSQL" is a completely meaningless expression. Please stop using it.

~~~
thenduks
It's just a marketing/catchy term for the trend to 'next-generation' (read:
'different') database engines. Since there is (in almost all cases) no SQL
(the language) involved, it is a pretty appropriate name.

~~~
jrockway
There's a reason that Java is not called "NoPHP". While technically true, it's
just not very descriptive or meaningful.

~~~
jbellis
A better analogy IMO is Python, Ruby et al calling themselves "dynamic
languages" to distinguish themselves from the popular static languages of the
time.

~~~
scott_s
But they called themselves dynamic, not "NoC++" or "NoJava". That is, it was
described by its intrinsic properties instead of as not "a specific instance
of a something that implements what they don't like."

SQL is a way to interact with a relational database, but it is not itself a
database. But "NoRDBMS" still isn't as good as saying what the technology is -
there are an infinite number of things that are not relational databases.

------
tjogin
From what I can tell, it seems the "dark side" mentioned only really affects
Voldemort and Cassandra, out of all the databases that fall into the "NoSQL"
category?

~~~
jbellis
Ad-hoc querying affects all distributed databases, which includes dynomite,
hbase, and hypertable as well as voldemort and cassandra. Also all key-
oriented databases, which include Tokyo Tyrant and redis. And some others like
CouchDB (where computing a new view via map-reduce is basically equivalent to
doing a query via hadoop against another system, so I don't consider that real
ad-hoc query support).

Really the list of databases that do provide ad-hoc queries is much shorter.
Neo4j (graph db), mongodb (single-master "document-oriented" db), the xml
databases, probably some other minor projects.

TFA also mentions data export as a "dark side" but for projects that lack this
it's really just a "nobody's needed it enough to write it yet" thing. So
that's less interesting IMO than talking about ad hoc queries which is usually
more of an architectural limitation.

~~~
vicaya
Well, ad-hoc querying _can_ be implemented in Hypertable with HQL, it's just
not a priority at the moment, so features are a bit lacking. I mean, it's not
an inherent 'darkside' of distributed database. OTOH, I've done cap shell with
HQL and grep/sort/whatnot a few times on a few TB of data and not exactly
painful :)

------
richcollins
These have nothing to do with the concept, just the current state of the
tools. There is no reason that a non-relational DB without a declarative query
language can't have good support for data dumps, ad hoc queries and ad hoc
"data fixing".

------
rgrieselhuber
Backups are a particularly serious issue for distributed variants.

~~~
jbellis
Do you mean because they offer poor support, or because there's usually large
volumes of data involved, or both?

Cassandra, for instance, provides an easy snapshot API, and you can back up
the snapshots using whatever infrastructure you'd normally use to back up TB
of data. (If you don't have such an infrastructure, yeah, that's a problem,
but not Cassandra's fault. :)

Edit: I should point out that since Cassandra supports multiple data center
replication already, I'm having trouble picturing a scenario where you want to
do anything w/ the snapshots besides just leave them on the Cassandra nodes
themselves (plan extra HDD capacity as necessary depending on how long you
want to keep them). But some such scenarios probably do exist.

~~~
vicaya
Replication != backup, if you/someone accidentally/maliciously screw up your
data with the API, the screw ups are replicated as well, especially with
Cassandra, where there is no builtin data versioning.

Backups are essential when you're doing major upgrades or data
migrations/mangling that can fail.

OTOH, data export is not an inherent problem of distributed database either.
If enough people want the feature, they can be built without too much fuss.

~~~
jbellis
> Replication != backup

Obviously. The context was using snapshots as backups -- you can keep them
around indefinitely, space permitting, and if you're using Cassandra's multi-
datacenter features they're automatically "remote" as well.

> especially with Cassandra, where there is no builtin data versioning

At the risk of belaboring the obvious, versioning != backup, either. :)

------
jbl
It seemed to me like the article was more about particular "NoSQL" databases
than it was about the genre as a whole. In other words, I don't think the
author's pain is endemic to this admittedly broad class of data stores, just
the few that do exist.

~~~
antirez
The "missing alter table" problem affects more or less any kind of schema-free
DB.

~~~
zaphar
How so. In most schema-free db's the missing alter table feature is not
needed. maybe you meant the quotes to be in a different place? missing "alter
table problem" perhaps?

~~~
vicaya
With completely schema-free db, you have to encode 'column' in a key, which
means you cannot delete or rename a column easily (in O(1) time).

~~~
evgen
You won't get O(1) since scheme-free dbs are not organized around
columns/fields in the same way a traditional db is all about rows and columns,
but just about every one I am aware of let's you create an index on a
particular field so at least you get O(n). Given the infrequent use of this
operation I am happy to leave it in the "O(n) is good enough" category in
return for the other benefits an efficiencies that a schema-free db offer.

~~~
vicaya
Well, you can get both with Hypertable, where you can alter table to add/drop
column families in O(1) time, while still have schema free benefits with
column qualifiers.

------
berntb
Hmm... there are different query APIs to all the NoSQL variants?

What stops you from doing a one-liner from shell (or a short throw-away
program) instead of a SQL query/update? (Well, except for the power of SQL
compared to the APIs, of course.)

Disclaimer: I don't know enough about these newfangled things and their
tradeoffs to have a serious opinion. They do look intriguing. :-)

