
NoSQL is a Premature Optimization - timf
http://smoothspan.wordpress.com/2011/07/22/nosql-is-a-premature-optimization/
======
birken
This is a great post, however I think there is one caveat... NoSQL is a
premature optimization for _relational data_. NoSQL isn't necessarily an
optimization if your dataset is one that isn't relational, where you gain
flexibility and get the scaling for free.

However, the truth is that most business datasets are relational, and because
of that relational databases do an excellent job of storing and accessing that
data in an efficient way. I completely agree that people who try to force what
would be a simple relational database into a complicated NoSQL system to gain
scaling in the future are making a huge mistake.

~~~
espeed
_NoSQL is a premature optimization for relational data_

I don't know about that either. Graph DBs have been around a while and are
ideal for relational data (you don't have to mess with tables and joins
because everything is explicitly joined), but Web developers are just starting
to realize their power.

If I am developing a website that uses the Facebook Graph API
(<http://developers.facebook.com/docs/reference/api/>) or the Twitter API to
store my users' friends and followers (and many modern auth systems are doing
this), a graph database is ideal because you can use a graph query language
like Gremlin (<https://github.com/tinkerpop/gremlin/wiki>) to do all sorts of
cool stuff, like find a user's friends of friends:

    
    
      g.v(user_id).outE('friend').inV.outE('friend').inV
        

This says for the graph database "g":

1\. Start at the vertex for the user_id

2\. Follow the outgoing edges labeled "friend" to the incoming vertices (these
incoming vertices are the user's friends)

3\. From the user's friends, follow the outgoing edges to the incoming
vertices one more time to get their friends.

This is so much simpler than messing with joins in a relational database, and
it's much more performant on large data sets.

~~~
thadeus_venture
Funny thing though, a well designed relational database will execute your
graph queries faster. It won't do generic graph queries faster, but if your
sql data model and code are optimized for the type of graph data you are
storing, sql will almost always be significantly faster, orders of magnitude
in some cases. Obviously not a data model argument, but a real world argument
nevertheless. If you then need to scale further you will probably end up
putting up the same caching layer between your database and the front end, but
at least all your data will be in once place (the sql database).

~~~
espeed
_a well designed relational database will execute your graph queries faster_

Why do you say that? -- see "MySQL vs. Neo4j on a Large-Scale Graph Traversal"
([http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-on-a-
lar...](http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-on-a-large-scale-
graph-traversal/)).

You can represent a graph in almost any data structure, including a relational
database. But the difference between a graph database and everything else is
that in a real graph database (like Neo4j), each node has an internal/local
index for its adjacent nodes so it doesn't have do an external look up for
each traversal step.

Watch this video on "The Graph Traversal Programming Pattern" to see what I'm
talking about (<http://vimeo.com/13213184>).

~~~
thadeus_venture
From the article:

"However, no attempts have been made to optimize the Java VM, the SQL queries,
etc"

Emphasis being on optimizing the sql. We have run tests comparing neo4j and
postgres, and postgres comes out with greater throughput for our data set,
where our database implementation was done people who know postgres extremely
well. Where you will see especially great differences is aggregate queries,
such as if you want to count the number of a certain type of connections
coming into a set of nodes, and then sort these nodes by that number. A sql
database is much better at stuff like that.

~~~
espeed
What were you using to query the graph?

Gremlin has significantly improved what you can do with graph aggregating and
sorting:

    
    
      // count incoming friends for each node and sort by most friends
      m = [:] 
      g.V.inE('friend').outV.groupCount(m)   
      m.sort{}

------
antoncohen
Full Disclosure: I work for Clustrix, though I don't speak for them in any
official or unofficial way. These are my personal opinions.

I think the author of the blog completely missed the point of what Michael
Stonebreaker (VoltDB), Jim Starkey (NimbusDB), and Paul Mikesell (Clustrix)
said. All three companies make scalable, relational, ACID compliant, SQL
databases. None of them are advocating using NoSQL.

Not to put words in their mouths, but they are saying you need what MySQL
gives you. You should not have to move the relational and consistency logic to
your application, that's the job of the database. The problem with MySQL is
not that it is SQL, it is that MySQL doesn't scale. In the case of Clustrix
their database is MySQL compatible, it uses the same library interfaces (e.g.,
MySQLdb in Python), the same command-line tools, and it can participate in a
MySQL replication setup as master or slave. The difference is it actually
scales, so you don't have to shard.

With MySQL, once you reach the performance limit of using massive hardware,
lots of read slaves, and memcache, the only thing left is sharding. With
sharding you are breaking up your database into lots of databases running on
different machines. With separate databases you lose a lot of the relational
ability, or you move it into the application. What the "NewSQL" databases are
offering is the ability to keep a single database, and when you want to scale
performance or capacity you just add another node to the DB cluster.

To those that say NoSQL is easier and more flexible than SQL, that entirely
depends on your data. You should pick the database that fits your data, which
might be multiple databases for different datasets within your organization.
Facebook uses a lot of NoSQL, but they also use a lot of MySQL. They have
obviously decided that some of their data works best in NoSQL, and some of it
works best in a RDBMS no matter how hard it is to scale.

So build your app using whatever works best for you. If that's CouchDB and
Redis, cool. If it's MySQL, go for it. Once you've proven that there is demand
for your app, and it takes off, then deal with scaling.

~~~
rimantas
I think MySQL NDB is worth mentioninig there.

------
jchrisa
Wrote this comment on the OP (waiting for moderation):

NoSQL is not just about scale (although it can be). In my opinion the main
benefit of giving up relational constraints is flexibility. Find me a
relational system that can credibly do offline replication to mobile devices.
Each of the options I’ve seen are full of gotchas. But if you replace the
relational model with an MVCC document model, you can suddenly solve an
entirely new class of problems.

Read about how Apache CouchDB is being used in rural Africa to bring
collaborative data and web technologies to health clinics that don’t have
reliable internet access: [http://radar.oreilly.com/2011/03/couchdb-zambia-
healthcare.h...](http://radar.oreilly.com/2011/03/couchdb-zambia-
healthcare.html)

These same patterns are applicable to mobile connections in the 1st world.

~~~
Joakal
I'm unsure what you're asking for, but isn't SQLlite suitable as it's used by
many applications even on mobile?

~~~
andrewtj
He's not asking for anything. He's pointing out how a different data model
allows for solving a different set of problems. The MVCC page on Wikipedia
might give you a better background so that you're able to contrast the two
models and determine for yourself where they are useful:
[http://en.wikipedia.org/wiki/Multiversion_concurrency_contro...](http://en.wikipedia.org/wiki/Multiversion_concurrency_control)

------
ryoshu
This seems like a complaint looking for a problem. NoSQL is a very broad term
-- are you talking about Cassandra, MongoDB, CouchDB, Redis, Memcached, Tokyo
Cabinet, neo4j, etc.?

Scalability is one piece of the puzzle; picking the best tool for the job is a
much bigger piece. Work with what you are comfortable with, but realize that
an RDBMS may not be the best way to deal with a graph. Logging is probably not
best done through an RDBMS. We have tools that we can use to solve problems in
a more efficient way, so we should use them.

RDBMS are not a panacea and neither are NoSQL database solutions. Use the
right tool for the job.

------
ericmoritz
One could argue that SQL is a premature optimization as well if all you need
is a key/value store or a document store. Using SQL just in case you may need
ad-hoc queries could be as harmful as using NoSQL just because you may need to
scale.

Each application is different and you need to pick the appropriate tools.

~~~
badclient
And that would be a perfectly fine argument! Though I am not aware of anything
significant that can be built without a relational db. Even the simplest
college assignments about building an address book involve querying and
referencing primary keys across tables.

~~~
icebraining
You can do queries and reference other documents many NoSQL databases; look at
the documentation for GAE's datastore:
[http://code.google.com/appengine/docs/java/datastore/queries...](http://code.google.com/appengine/docs/java/datastore/queries.html#Fetching_Results)

Or for MongoDB: <http://www.mongodb.org/display/DOCS/Advanced+Queries>

~~~
dreamdu5t
That is relational though. In those cases, Redis and MongoDB must create some
kind of key reference table like MySQL would.

~~~
icebraining
It depends on what do you mean by relational. I mean, even the simplest key-
value store has to have keys, so if you store the key of another
record/document, you have a "relation."

But you can't set _constraints_ that enforce such relations, which is the
whole point of the relational model.

------
mobileman
I call bullshit.

Basically, Nosql is not just about scale but flexibility. Give me a schemaless
engine anyday

------
donpark
Wrong from point 1. I use MongoDB and Redis because first is easier to
prototype with ad hoc and second offers unique features one has to remap
manually if SQL is used. In short, NoSQL requires less investment than SQL for
my use cases.

~~~
imack
Definitely agree. Getting Rails up and running on MongoDB is a lot faster than
with ActiveRecord on top of some SQL. Especially when your data format is
changing quickly, it's nice to just make a few changes in the model rather
than having to track migrations.

~~~
jasonkester
You realize that this has nothing to do with sql/nosql and everything to do
with your chosen framework, right? If, you used, say, a framework that
autogenerated itself from your schema at build time, then changing the
database would be as simple as changing the database, which is trivial.

I think a lot of our perceptions about technologies are influenced by the
things that sit between us and those technologies. A lot of the talk about
this particular issue has nothing to do with Relational databases or key/value
stores.

~~~
rhizome31
Either I didn't get your point or you don't know what what Rails migrations
are. Migrations are just a way to "change the database" in a repeatable way.

~~~
jasonkester
Right, but the grandparent's complaint was that you had to change your model,
then write migrations to match.

In other frameworks, you use your tool of choice to change the database
schema, then build the project and you're done. All the "model" stuff
regenerates automatically. Thus, schema/model changes have no pain associated
with them whatsoever, so there's no particular advantage to using nosql vs
sql.

~~~
rhizome31
With Rails' ActiveRecord the model introspects the database schema at run time
and figures out field changes automatically. What it doesn't figure out is
relations and you still have to specify them in both the database schema and
the model. Since you're mentioning project build, I suppose you're referring
to Java or C# frameworks. Are those frameworks able to generate code for
relations as well? (I suppose this requires following naming conventions or
setting referential constraints)

~~~
jasonkester
Yeah, conventions cover pretty much every case you tend to come across. Got a
column called CompanyID that has a foreign key constraint over to the Company
table? Perhaps this is a relation... Nice of Rails to make it acceptable to
simply ask you to name your stuff in a consistent way so that things just
work.

The added bonus of a build step is that you don't need to keep regenerating
the same classes and SQL code every page load, but it's not really any
different from the programmer's standpoint.

------
joe_the_user
Obviously, it also depends on exactly what kind of application you are
building.

If you primarily want to persist objects and indices that don't have good or
easy analogue in the SQL world, a NoSQL solution is good solution because you
don't have to bother with the unneeded SQL overhead.

If you've doing something that maps reasonably well into the relational model,
well, that's what you should use. Rational technology has been around a long
time and has been scaled to huge levels. And it seems fairly clear a social
network maps pretty well to a relational model.

------
gdulli
I've been trying to tell my colleagues and higher-ups this for the last few
weeks as I've planned out a new platform that "needs to scale." They've heard
that some of our business partners are using MongoDB and want to know why I'm
using Postgres.

The truth is that our traffic could wildly exceed our most hopeful projections
and we still wouldn't nearly be in the position of orgs like Twitter,
Facebook, or Netflix, stretching the limits of what an RDBMS can do without
becoming a nightmare to maintain.

~~~
badclient
Sounds like my experience applying to companies who list every NoSQL option
under the roof in the job description and yet looking at what they do, I can't
figure out why they'd need all that.

------
Androsynth
In theory you should use the right tool for the job and there shouldn't even
be a discussion.

However in reality not all DB's are created equally. It comes down to this: do
you want to spend a known amount of time hammering in a screw, or do you want
to spend an unknown amount of time optimizing your state of the art, but
unproven screwdriver.

This is important for people starting a business. You don't start a business
to test out new technology, that's what weekend projects are for. They need to
be focusing on the product and not spending time 'just getting it to work'.

Mysql/MC aren't sexy, but it takes a trivial amount of time to set up and tune
them.

There are lots of caveats to this; the entire argument can be nit-picked to
death. But I think it holds for the generic webapp template that just needs
data to persist and then be served back to users.

------
esmooov
A lot of the confusion and furor over NoSQL vs relational vs NewSQL is caused
by the idea that these technologies demand any sort of mutual exclusion. At my
shop we use Redis and MySQL and SQLite extensively and - to the best of my
knowledge - they have never gotten into scuffles with each other, holy wars of
relationality. In fact, this seems to be a proclivity unique to their users.

That being said, I think it is - wait for it - premature to call NoSQL an
unwarranted early "optimization." For our real-time analytics app, we use
Redis because we are just storing key (page name) value (hits). With millions
of writes a day for relatively simple data, it is nice - and in no way
premature - to run Redis on a medium-sized instance. Data retrieval is fast
and after a year and a half we haven't had any problems.

Other apps, use MySQL exclusively. For our many-many forests of tag, entry,
source, author, etc etc relational data, we rely on the easy oversight of
MySQL and the rich ORMs that have gracefully evolved to cut through the
jungle.

What's more, there are many applications that use both Redis and MySQL. My
Silver gem <https://github.com/tpm/silver> wraps MySQL requests in a Redis
caching layer to speed up queries. This is not terribly novel or that
different than, say, memcached but it does what we need it to.

The point is that just like apps are as multiform as the imagination is
competent, so should be the tools and strategies we use to create them and
solve their problems. Yes, people will use NoSQL incorrectly. People use MySQL
incorrectly and inefficiently more than not. Hell, you can use YAML in a bad
way. Those aren't the cases we should focus on. We should focus on all the
systems that do work.

I'd like to see, for every mind-thinky post about whether or not MySQL is a
death trap or NoSQL is a nu-wave panacea, a post about how some company is
making the technologies that they have chosen work for them. At the end of the
day, there is no uniform march toward optimal global technology efficiency.
Just the small victories of a startup that needs to record cat GPS movements,
a newspaper that needs to analyze a million pages of leaks, a broker who needs
to pour through SEC dumps. NoSQL, NewSQL, SQL, HithertoUnknownUnPostSQL: I say
let em all in. Someone will find them useful.

~~~
dreamdu5t
What about lock-in though? What if you decide you _do_ want some relationship
or way of filtering the keys you're storing with Redis. After you already have
it in Redis as keys/values.

What then?

~~~
pjscott
Hopefully you abstracted as much of the database code as possible, so you can
switch to (for example) MySQL without serious pain. This only gets tricky if
you're using some of the more fancy features of Redis, like sorted sets, which
are a bit more clunky to do in SQL.

------
badclient
Can someone explain to me what NoSQL equivalents are for concepts like primary
key and general data consistency?

My other question is that I've read several places that NoSQL makes most sense
for full-text documents. And yet I see company after company with little
apparent full-text data boasting about their NoSQL use. I am especially aware
of this as someone in the job market seeing terms like Hadoop and MongoDB just
thrown around in the job description. How? Why?

~~~
icebraining
There isn't a single model, but an eventually consistency is the most common:
[http://www.allthingsdistributed.com/2008/12/eventually_consi...](http://www.allthingsdistributed.com/2008/12/eventually_consistent.html)

------
6ren
Isn't a key-value store (NoSQL) simpler to use than a relational database? (I
haven't used one so I don't know for sure).

Hierarchical and network stores were popular before RDB was invented; although
RDB was slower, the big advantage was that you could transform the data into
whatever form you need (instead of having to use the specific hierarchy used
to store it). At least, that was the key benefit according to my reading of
Codd's 1970 paper.

~~~
_delirium
Yeah, I can see this argument going either way. If you're avoiding SQL for
NoSQL for scalability reasons, that might be premature optimization,
especially if your data is pretty easily relational (and e.g. you have
something approximating a schema).

But if you just need to persist a data model, don't care too much about the
schema, and come up with the solution of, "I'll just serialize my hash table
to a CSV flat file", that's just taking the ultra-simple solution, not really
an optimization. If anything, pulling out the big iron of SQL when all you
need to do is serialize a hashtable would be the premature optimization (or at
least, premature architecting).

------
viscanti
Awful post. His example for why NOSQL is "difficult" to start with, is a
company with a huge RDBMS trying to transition. Had they started with NOSQL,
it would have been easy to architect around that from the beginning, rather
than cramming it in later. He then goes on to argue that it's much easier to
transition to NOSQL later, which is the exact opposite of his example.

NOSQL isn't one catchall group. There are different tools there for different
jobs. His arguments rest on the assumption that NOSQL is hard(er). If it's not
hard, and it solves a problem, then it's not premature optimization, it's just
optimization.

There are certainly situations where using a NOSQL database to start with
gives you a competitive advantage and makes things easier down the road. This
same argument is made every time a new technology comes onto the scene. People
who made a living using something else will argue that learning the new thing
is too hard, and isn't worth the time. If he doesn't want to bother with
learning about new tools, that's fine, but it's silly for him to argue that
learning them is pointless or a waste of time.

------
fedd
NoSQL is not only about scale, but also about flexibility?

the flexibility and schemalessness of NoSQL is actually a consequence of the
scaling possibilities - it's hard to manage horizontal partitioning and
maintain consistency between multiple tables with foreign keys and rollback
transactions in a distributed environment; and easy when you have only one
table with two fields: key and value.

when, like in mongodb, the value is actually a tree-like document, and you
need to have an index on some of the element, you gradually return to schema,
because this element should be in the same place/surrounding in all indexed
values.

i am saying that flexibility is not why you should choose NoSQL, but actually
the scalability that you might need later.

(and non-relational data is kept ok in blob fields or in files and indexed by
lucene - if you want (i don't))

------
campnic
I found this article fairly uninformative.

His first point is 'all NoSQL is less mature'. Second point is 'there is no
advantage to NoSQL except scale' Third point is 'you'll have time to scale
later, which includes moving to NoSQL'

None of these arguments is novel. Developers pick NoSQL for other reasons as
well.

I didn't find a pioneer tax to setting it up. It worked extremely well for
non-uniform data. I found its design easier to handle for failover.

This entire debate is old. NoSQL is a tool. His arguments add nothing to the
debate.

------
JVerstry
Fantastic post. Tells the truth.

