

I Can't Wait for NoSQL to Die - jbyers
http://teddziuba.com/2010/03/i-cant-wait-for-nosql-to-die.html

======
rufo
_NoSQL will never die, but it will eventually get marginalized, like how Rails
was marginalized by NoSQL._

Is it just me, or does this statement make absolutely no sense whatsoever?

~~~
alextgordon
I don't know if this is what he was trying to say, but it makes a little more
sense if you change it to

 _NoSQL's hype will never die, but it will eventually get marginalized, like
how Rails' hype was marginalized by NoSQL's hype._

------
strlen
AdWords implemented on top of MySQL? Perhaps the CRM portion of AdWords (i.e.,
where the advertisers submit their ads and publishers view their balances) is
-- it's fairly easy to partition by functionality and doesn't have extremely
tight latency bounds. This isn't where real time auctions (what really
distinguishes AdWords from what came before) happen.

You can be _sure_ , however that the data used for real time ad auctions is
extracted out of MySQL and into a _highly_ customized data store (likely, a
pure in memory one). It's all about using a right tool for the job. You can
also be sure that you'll never see a paper on that data store, as that's their
competitive edge. If you could duplicate it with off the shelf components
(whether MySQL _or_ Cassandra), Google would be toast.

Likewise, I am sure Amazon uses Oracle for their billing system and catalog
submission interface, but they use specialized systems for search, shopping
cart and recommendations.

For a business app that only needs to scale to the amount of _paying_
customers (i.e., advertisers, account managers and customer support) and has
no real time constraints -- but on the other hand involves complex and
frequently changing business logic (e.g., where altering tables may be
required) an RDBMS is the right tool for the job.

Where latency matters, data grows much faster than Moore's law (in relational
to main memory size), Amdahl's law starts to matter in regards to computation
(computation work load needs to be partitioned to take advantage of
parallelism), and traditional caching strategies simply don't work, something
else is. That situation is starting to become more and more common across web
companies. You can also be sure that places like Wallmart and the like employ
plenty of non-relational technologies (my personal bet would be is that
they're likely using Coherence or Terracotta): usually, however, they're
expensive and are built/configured by field-engineers to be custom tailored
for their workloads. When you employ a world-class engineering team, "build"
starts to make more sense than buy when you're solving a very specific and
constrained problem (e.g., fault tolerant shopping cart system).

You don't need to be of Google's size to be at that stage. Talking about
scalability and performance without taking the workloads into account (e.g.,
"Google Facebook or Amazon" as if e-commerce, search and social networking
were compatible) is also an anti-pattern: I am sure engineers at Google would
laugh when you compare Facebook's scale to theirs; likewise Facebook's
engineers would laugh when you compare the real time aggregation that happens
on their site to what happens at Amazon; Amazon's engineers would likely tell
you holiday season pager duty horror stories that would scare Facebook _or_
Google engineers.

~~~
derefr
> a highly customized data store (likely, a pure in memory one)

That's when we stop calling it a data _store_ , and start calling it a data
_structure_. Data stores are where data goes when it's not part of the working
set. With that definition, it's perfectly sensible for AdWords to use MySQL as
its data store.

~~~
strlen
(Edit: this is a longer reply than I intended, no longer really intended as a
direct reply to the parent; this is more a reflection on systems architecture
of data-intensive applications).

That's a good point, but a pure in memory data structure is:

a) Not persistent to disk _at all_. Judging from my own experience with
similar low-latency systems used in ad serving (where we called these "data
servers") and other similar systems, the data is likely to be persisted to
local disk and the deltas replayed to it from a MySQL db to avoid long restart
times.

b) Lives within the ad server process. This is likely not true, as the ad
server process will need to compose a "working set" for particular ad auctions
from multiple data sources (bid price for each ad, keywords,
budget/delivery/campaign specifications for each ad, keyword relevance of
ad/ad campaign). Each of these data sources is likely represented by a
different data structure (red-black tree for one, hash table for another, trie
for yet another, graphs, B-Trees, etc...), has very different characteristics
in terms of cache-locality, rate of change, size, density and comes from
multiple places (some from RDBMS, others from Map/Reduce)

(Interesting side note: earlier I also wanted to say that neither the data
structures are usually not partitioned in how they're store, now is
computation done on them partitioned. However, with the age of parallel
computing this is simply not true: there are now parallel data structures and
algorithms).

One compromise is perhaps we can call these systems "data servers" or "data
structure servers" (afaik Redis does the latter). MySQL (or any other RDBMS)
merely _feeds_ these systems through some form of message oriented middleware.
In this case RDBMS (and this is an over simplification which doesn't cover all
the corner cases) is merely acting as tape: changes are played forward and not
randomly accessed. RDBMS that is the source of truth for ad-serving is _never_
queried real time and can easily be taken down for maintenance while ad
serving continues. It doesn't even need to be highly available (if advertisers
can't submit ads it would certainly be a _huge_ and costly outage, but much
less costly than if users see ads!).

Note, such a system is also necessarily eventually consistent (in the truest
meaning of the word: customer receives an SLA which corresponds with a point
where the serving component is consistent with the DBMS).

There still needs to be an efficient OLAP component to back the CRM/ERP
functionality of this system, for which an RDBMS is still a good bet (combined
with an off-line system e.g., Map/Reduce for more complex reporting and
optimization). However, had an end-to-end ad-serving system been written from
scratch _now_ , would the RDBMS component serve as primary source of truth
(rather than _just_ as the backend for the publisher/advertiser/support UI
component).

In addition, this ("write to RDBMS, serve from elsewhere") design is also very
specific: writes to the "ad submission database" are rare and don't always
require high availability. Consistency (between the RDBMS and serving
component) can be _much more_ eventual than would be in a Dynamo based system
(where the weak "can't read my writes" eventual consistency is only a failure
condition).

Now suppose you _also_ want highly available, low-latency writes (even if not
at the same frequency as reads) and you'd want to be able to read-your-writes
in normal situations. This makes the "write to RDBMS, serve from something
else" (effectively what popular memcache+MySQL deployments are) scenario more
brittle. You now have much harder questions to answer (do I want a system
that's always in a consistent state e.g., to avoid having to do quorum
reads/writes? am I okay with eventual consistency as a failure scenario?
etc...) but with many workloads this becomes a necessity.

Despite speaking at NoSQL events, I am not a big fan of the NoSQL name. Not
only do these systems _not_ intend to completely displace SQL based RDBMS
systems (and as with ad server example can exist side-by-side with them),
additionally these systems provide functionality that _can't_ be provided by
RDBMS systems (and _not_ just due to scalability concerns).

------
teilo
There are a rash of these articles popping up. While there are usually valid
points in urging people to avoid the hype of the "next big thing", a lot of
these guys seem to be bitching that they might have to learn a new skill set.

I have been in this business long enough to remember Clipper and FoxPro devs
bitching about SQL when it was on the rise. This sounds about the same to me.

~~~
mpk
These articles are just link-bait with no content. The 'NoSQL' buzzword is as
distasteful to me as, say, 'cloud', but that doesn't mean that articles
discussing them don't have value.

The flurry of 'NoSQL' articles often cover the different approaches to data
stores, their implementation, their interfaces, their management, performance,
scalability, etc. That interests me, but doesn't mean I'm going to go to work
on Monday and kill all the non-NoSQL dbs we have running.

Hate-against-hype articles are waaaaay overrated.

------
paulgb
Ted's point may be valid for BigTable-like databases. (I'm not saying it is,
but I don't know enough about those to say so.) Those are designed for
scalability and if you don't need the scalability you probably _should_ use a
RDBMS instead.

But there are other advantages of SQL-less databases that don't deal with
scale. I deployed my first MongoDB app a couple weeks ago. Even though it was
a small (~1 developer month) project, and neither myself nor the other
developer had used MongoDB before, I still think we finished faster than if
we'd used MySQL. Just like Cassandra is a premature optimization if you just
need an RDBMS, an RDBMS is a premature optimization if you only need an object
store.

~~~
japherwocky
I got downvoted awfully last night for trying to say this, but I'll say it
again to back you up:

Developing with mongodb is _lightning fast_ and holds up very very well. If
you end up with problems, switch to SQL later!

I don't think a lot of these people hating on the nosql projects have actually
tried building something with them.

~~~
richcollins
Why is it faster?

~~~
paulgb
It doesn't force you to fit non-tabular (eg. hierarchical) data into a table
structure.

Also, no schema means the data structure is more malleable. With the right
ORM, this fits in well nicely with polymorphism: I can store objects with some
common features in the same collection, but when I retrieve them from the
database I get different types of objects which inherit the same base object.
mongoengine is one ORM that does this.

~~~
heresy
Nice. But I have a difficult time seeing how storing "malleable" data like
that, opaque to the storage engine, is going to be performant for querying.

Must be nice to have requirements that never change once you've decided on a
data representation...

~~~
paulgb
The database is still aware of the fields, so MongoDB can build indices on
certain fields if you wish. Admittedly I haven't deployed Mongo in an
environment that really tested its performance, but we've been serving about
20k pageviews per day with no issues. Granted, this was a fairly basic
application.

As for changing requirements, mongo handled those well too.

It's certainly not a silver bullet, but when I just need a basic object store
the query performance trade-off is worth it.

------
pradocchia
It occurs to me that MySQL started off as a thin SQL wrapper on a NoSQL
database: here, have a SELECT and WHERE, but you'd best not JOIN, and forget
about transactions or referential integrity.

Then, over time, they tacked on a few more relational features, but they had
yet to solve the hard problems of relational databases.

Meanwhile, the people who were originally drawn to MySQL as a dumb-and-quick
datastore got frustrated with this line of development and christened the
NoSQL movement. It's not so much a departure from relational databases (they
were never really there), but a return to _MySQL_ basics, w/out the SQL.

------
rythie
"Did you know that Cassandra requires a restart when you change the column
family definition? Yeah, the MySQL developers actually had to think out how
ALTER TABLE works, but according to Cassandra, that's a hard problem that has
very little business value. Right."

Really did the MySQL people think about it? because it takes ages to do an
ALTER. Even when you are doing something like _dropping_ an index it can lock
up for hours, where no one can do any inserts. In contrast restarting a
service is no big deal.

~~~
newhouseb
> In contrast restarting a service is no big deal.

Yikes, restarting a service that's so essential to everything else in web
infrastructure is certainly a big deal. Where I work (large 30+ million
users/month site), we have batches that do all sorts of processing and DB
crashes (basically equivalent to a restart) can be a major pain because it can
be difficult to figure out exactly what failed and when and how to best
recover. You might say, "oh just re-schedule all the batches to allow
downtime", but once you have 30 developers with a hundred or so batches, that
can be damn near impossible to orchestrate.

A simple stateless web-app can probably tolerate a DB restart, but Cassandra
was built to scale - not to host a to-do app.

ALTER takes ages to do because of the ACID constraints of MySQL. If you want,
you can sacrifice the ACID constraints by just cloning the table with the
proper modifications and then dropping the table, but I'm venturing into DBA-
land for which I am in no way qualified to profess knowledge.

~~~
pradocchia
> ALTER takes ages to do because of the ACID constraints of MySQL.

No, I don't think it has much to with ACID. Rather, they made a simple single
implementation of ALTER TABLE that copies the whole table out on any change
whatsoever. Add a column? Recreate the table. Add a table comment? Recreate
the table. Drop an index? Recreate the table.

They _could_ have identified cases where in principle the table & metadata
could be modified in-place, but that would be a lot harder than a simple copy.
It would probably necessitate changes to the legacy architecture, which in
turn would require a host of other changes.

------
mark_l_watson
I can't believe some of what is said by people on both sides of the NoSQL
arguments. Discounting use of RDF data stores, almost all of my recent work
involves PostgreSQL and MongoDB. I think that it is blatantly obvious which to
use in specific circumstances. I have not had to do this yet, but using
Datamapper.setup, you can integrate the use of both in the same application by
storing some model data in a relational database and some in MongoDB, as it
makes sense to do so.

~~~
mark_l_watson
Here is an article on mixing Datamapper + MongoDB + MySQL:
[http://lunarlogicpolska.com/blog/2010/02/15/mysql-and-
mongod...](http://lunarlogicpolska.com/blog/2010/02/15/mysql-and-mongodb-
working-together-in-kanbanery)

------
jbyers
As is to be expected from this author, this is definitely on the flame-bait
side of things. I submit it because I believe there is an important point
here: for the vast majority of startups, going with a relatively unproven
"NoSQL" database is a premature optimization and an unneeded technical risk. I
disagree with the author that these databases are a flash in the pan, but
their over-application is.

~~~
psadauskas
I dunno, I see the opposite: RDBMS's are a premature optimization. In my
experience, it's /much/ easier to hack together a quick webapp in MongoDB,
because you don't have to worry about relations, migrating schema, etc. Sure,
it might be slower than Postgres on a billion-row table, but wait until you
have a million rows before you shackle yourself to the relational constraints.

~~~
bilbo0s
Absolutely right.

He has it backwards. You use NoSQL to 'get shit done'. When you have a billion
rows, then worry about schemas. By that point you will have a much better
idea, a - what said schema should look like, b - what the architecture of the
Postgres, or mysql, or Oracle should look like, and c - how much money you
will have to solve the problem.

~~~
prodigal_erik
I once worked in a Notes shop. Notes has no schema for documents and nothing
to enforce migrating data from older documents to the current format. After a
few years of customers manipulating data with various versions of the code,
they had documents in such bizzare combinations of states that it was no
longer possible for anyone on our dev team to inspect them and say which
behavior would be right for the workflow.

Schemaless data should only be a summary of data properly maintained
elsewhere, which you can regenerate at need. If your _authoritative_ data has
no schema, it will decay to garbage.

~~~
bilbo0s
That's because you waited years to address the problem. Not only that, you
also rewrote the code, as I advise. But you did not take the opportunity to
address the structural data issues you were having, contrary to my advice.

My strategy is to rewrite the code, if needed, but with an eye towards
addressing structural data issues. After a few months use of a web app you
have a good idea of any surprising usage patterns that may appear. Readjust at
that point when you are 'talking with data'.

This advice is for small startups of the HN variety, where 'customers' are a
lot more important than 'authoritative' data stores initially. NoSQL systems
are useful tools for mitigating the danger of doing too much engineering
upfront. Many tech entrepreneurs fall victim to doing too much upfront
engineering in the hopes of their data store not 'decaying to garbage', only
to find that no one wants to use their product. NoSQL makes it easy to go back
and migrate off the data you want to store 'on the move'. When you have a
better idea of how much of it there is, and how it is used.

~~~
prodigal_erik
If you are not doing data migration, after n revisions to the data management
code, each record can be in any of 2^n states, depending on which code
revisions did or did not modify it. How many revisions can you make before
your code can no longer handle some of your older data? I'd say days' worth,
not months, because you're trying to iterate a lot faster than we did. And the
odds of a complete rewrite understanding all your old data are even worse.

If you are doing data migration, you necessarily have an old and new schema in
your head. At that point you're just refusing to write it down and let the
tools tell you whether the code agrees with you.

------
derefr
No one has ever explained this to me: why are we partitioning this space? Why
can't a single database management system:

* have individual tables, indeces and views that are _either_ relational _or_ document-oriented, _or_ graph- or object-based while we're at it, on a case-by-case basis,

* manage them all in a single, well-known distributed pool,

* and present a unified API to access all of them (e.g. a Structured Query Language of some sort)

* that allows tables of disjoint types to be joined in queries, with appropriate warnings when it creates non-optimized query plans?

In other words, why can't I say that my reports table should use the
"relation" backend, while my messages table should use the "document" backend,
and be done with it?

It's as if, when you went to a car dealership, they asked you whether you
wanted to see the "cars with cigarette lighters" or "cars with automatic
windows" section. Why can't my car do both?

~~~
vog
_have individual tables, indeces and views that are either relational or
document-oriented, or graph- or object-based while we're at it, on a case-by-
case basis_

This is already the case. Nowadays, almost all relational databases (except,
of course, MySQL) support XML columns. PostgreSQL supports them rudimentary,
and DB2 and MSSQL have even special storage strategies and index structures
for XML, i.e. for generic tree structures, data-oriented as well as document-
oriented ones.

Also, abstract data types ("encapsulation", the base of OO) are implemented in
these databases, too (except, of course, in MySQL), as well as other OO
features such as table inheritation and some kinds of polymorphism.

~~~
derefr
I'd love to see a comparison between using these XML engines for queries and
NoSQL, then. I'm betting they'd be competitive at least to the point that, if
you already had one of the supporting DBMSes set up, there would be little
point in training your DBA on NoSQL as well.

------
duncanwilcox
NoSQL might be hype. Let's get specific. Cassandra eliminates the SQL database
single point of failure and hard to replace masters via a lose sync,
"eventually consistent" protocol.

Is there some startup offering a web service that doesn't need that?

And have you ever tried to deploy an SQL database capable of thousands of
miles apart syncing?

Eventually consistent is quite a different model than ACID. If you accept
that, and accept that you can't rely on networks to always be up, you'll live
comfortably and cost effectively.

~~~
rbanffy
I wonder how long will it take for the simple "if you need ACID, go SQL, if
you don't, you'll be fine with NoSQL" truth to sink in.

NoSQL databases have been in use since before I was born. Is anybody doing
airline reservations on DB2?

------
codexon
I recorded a list of issues that may prevent you from using Cassandra as a
general purpose website storage right now.

<http://www.codexon.com/posts/is-cassandra-ready-yet>

It appears as though Ted's complaint about needing to restart Cassandra to
modify ColumnFamilies (tables) is nearly obselete. A patch for the last
remaining subtask has been submitted.

------
physcab
I can't wait for these types of articles to die.

------
jgerman
I'm getting tired of both sides of this argument, I'll be happy when the whole
back and forth dies :). Rarely do you see a balanced opinion. Sometimes it's
people that are fanatical about the new-ish NoSQL idea. Other times, like
this, it's someone so stuck in their ways they think that everything but what
they like is a fad and nothing will ever change.

One of the key things I look for when I interview developers is that they can
recognize the right tool for the job. Potentials that get married to a
technology or language are shown the door pretty quickly.

Also, as others have pointed out, this particular article seems to not quite
understand the decisions involved, to the point of getting some things
backwards.

------
wanderr
I myself am not actually sold on the noSQL movement, at least not on the idea
of ditching SQL entirely. It has its place, but may not be the best solution
for every problem.

That said, on the authors complaint about having to restart cassandra when
doing the equivalent of an alter table: lately every time we do an alter table
in MySQL (which takes hours on large tables, during which time you can do
nothing with them), when the alter finally finishes, MySQL mysteriously
crashes. MySQL may have given more thought to the problem, but their solution
obviously has problems too.

------
wrath
Macs are better than Pc, C# is better than Java, Unix is better than Windows,
RDBMS are better than NoSQL databases...

Why can't people just use the best tool for the job and move on...

From a personal stand-point we've switched from MySQL to Google AppEngine (and
BigTable). Although I find there are some major drawbacks (e.g. joining
tables) not having to worry about database servers and scalability is a major
advantage. That said, if MySQL becomes the best tool for a particular feature,
then let it be...

------
tlrobinson
The scalability aspect of "NoSQL" is interesting, but I think possibly the
more interesting part is the wide diversity of data models (key value, schema-
less tables, document databases, etc)

True, some of these models are more restrictive than traditional RDBMSs to
provide scalability, but I think some of them will often be useful even if
scalability isn't initially a concern.

In fact the term "NoSQL" itself is more relevant to the data models than the
scalability.

------
shin_lao
The author makes some very valid points, but I will retort that RDBMS are
overused as well.

Sometimes you just want to store data on the disk in a safe and language
agnostic way.

You don't care about relations.

In that case, many "NoSQL" engines are really great.

~~~
aheilbut
"Sometimes you just want to store data on the disk in a safe and language
agnostic way."

You mean, kind of like a file?

~~~
paulgb
Well, since you need a _safe_ way, you'll need a locking mechanism, too. And
since it needs to be language agnostic, you can't just dump the internal
representation of your object to disk, so you need some sort of serialization.

At that point, it's probably easier to go with an already existent object or
document store.

~~~
mpk
Ladies and gentlemen, we have a winner!

If you start using the filesystem as a datastore that requires concurrent
access you open up a whole new can of worms. You need a locking mechanism -
which you'll probably implement using (wrapped) native syscalls. Not only does
that break cross-platform operation, you'll also have to work on and fix (but
find first, of course) bugs in the locking implementation. As you spend more
and more time on this and your app starts growing, you'll find yourself
spending more and more time working with the limitations of the filesystem
you're using (file size limits, directory size limits, access times for files
in large directories). You can hack your way around all that but then you have
to face other critical tasks. Say .. backup and restore procedures. Can you do
partial backup/restore operations? No? Well, get ready to write code for that
too. And you preferably want to be able to do those live. Remember those
locking issues you solved when you started down this road to hell? Yeah,
they're back with a vengeance now.

How about a full restore? Maybe you should have implemented a replay-able log
system to get that full restore up to speed with the state of the db since the
time of the last backup.

Or maybe this isn't exactly the right point at which to re-invent the wheel :)

------
dacort
+1 for the Batman rollerblader. But that's about it. Use what works, some apps
do justify nosql off the bat. Many don't.

~~~
aaronbrethorst
He's a fixture of Seattle's big Solstice parade:
<http://www.flickr.com/photos/daffodilious/3646480480/>

Keep digging through that photoset and you'll probably see pix of the naked
bikers, too.

------
VBprogrammer
I found the comparison with 'Real Businesses' particularly funny, given Wal-
mart have have 2.1 million employees worldwide and Twitter has 75 million
users...their scaling requirements are different by a factor of 35.

~~~
code_duck
You're comparing two completely different businesses on disparate metrics.

How many 'users' does WalMart have worldwide? I'd say at least 500 million on
whom they keep a purchase record. Then there's products, credit card numbers,
suppliers, etc.

~~~
VBprogrammer
But in terms of their 'real business' it is not the number of customers they
have but the number of they have but only their employees who will be using
their systems. The purchase records, products and credit card numbers are
closer to Tweets than users.

~~~
code_duck
I don't think the number of people doing data entry or accessing a system
matters nearly as much as how much data one has to be tracked. WalMart's data
needs exceed that of Twitter, they exceed the needs of Facebook. I don't know
if they meet or exceed Google, but I'd imagine it's up there.

------
abalashov
Thank goodness - someone finally said it.

