
No to SQL? Anti-database movement gains steam  - johns
http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9135086
======
mdasen
_Cassandra can write 50GB of data in 0.12 milliseconds, more than 2,500 times
faster than MySQL._

Really? Cassandra can write 416TB, 666GB per second? If Cassandra can write
50GB in 0.12ms, then it can write over 416TB in a second and 25petabytes per
minute. Which clearly isn't true. Of course, that quote would also mean that
MySQL can save 166GB per second.

The article is a piece about technology that the writer doesn't understand.
They all have trade-offs. Key-value stores eliminate your ability to access
data by anything other than its key. So, you have an article with an id of 5.
What if you want to look up the articles by author 9? You're just unable.
Column based databases aren't magic either. They just change the orientation
of the data. Rather than seeing 1,Adam,West;2,Mark,Twain;3,Will,Smith you see
1,2,3;Adam,Mark,Will;West,Twain,Smith. That offers some advantages - such as
being able to do metrics on a column easily - but it also means that if you
want to get the data for one row, it has to do MORE work than a regular row-
based database. To retrieve 1,Adam,West it has to do multiple lookups since
the data isn't located next to each other. And frankly, random access of a row
is what you're likely to want.

 _They run on clusters of cheap PC servers. PC clusters can be easily and
cheaply expanded without the complexity and cost of "sharding," which involves
cutting up databases into multiple tables to run on large clusters or grids._

That's like saying that if you plant a hamburger you'll get a McDonalds. They
don't just magically run on clusters of cheap PCs. The fact is that there are
tradeoffs. If you put 50% of the data on server1 and 50% on server2 you'll get
faster query speeds except that you then have to know where to look for a
specific piece of data.

So, sharding: basically, you split your data along logical (or non-logical)
lines into separate databases. Like registration at a conference, users A-M
are in DB1 and users N-Z are in DB2. So, when someone comes up with a query,
you can easily tell which place to send them and then the query is run there.
Or you can non-logically split and have a map that tells you where to go. So
you query metadb and ask "where's Frank" and it says "Frank is in DB1" and
then you query DB1 for Frank.

BigTable isn't so different. It does do some awesome auto-splitting of tables,
but when it comes down to it, it's much the same. There's a single META0
tablet that gets queried to find the location of the META1 tablet which knows
the location of the actual data. And, in order not to overload the META0
tablet, one needs to cache the location that you get back.

\--

Often times, the problem is that you have web programmers who wouldn't know a
B-Tree from a linked list. All of these technologies exist because there are
certain things that they're good at. However, what this article purports to
know is that these NoSQL databases are just superior. In most ways, they're
inferior. A key-value store has its place, but it's severely limiting in a way
that you can't _just_ use a key-value store; you need more querying power.
Likewise, whether a column-oriented database is right for you depends more on
how you want to access the data (by column or by row).

Oh, and really, learn SQL indexing and check whether your queries are doing
full table scans. That's the root of a lot of problems. I mean, a good index
should do lookups in log(n) time which means that on 10,000 rows a query
should take 1/769th the time; on a million rows a query should take 1/50,000th
the time; on a billion rows it will take 1/33,333,333th the time of doing a
full table scan of a table that size. Yeah, indexes make a huge difference,
but they aren't magic either. They merely order the data in a certain way that
makes it easy to pluck out certain rows since you don't have to look at every
value. For example, dictionaries are alphabetized and so you don't have to
look at every word. You just start jumping toward the area of the word you're
looking for. Indicies work the same way. Now imagine if the dictionary was
unordered. You'd have to look at every single word to see if it was the word
you were looking for.

Oh, and just saying "put an index on stuff" isn't enough to know. It isn't too
complicated, but there's a Google talk that goes over some of the mistakes one
can make with assuming that indexes are doing what you want
([http://www.youtube.com/watch?v=u70mkgDnDdU&feature=chann...](http://www.youtube.com/watch?v=u70mkgDnDdU&feature=channel_page)).
It's in the middle somewhere, but it's a good watch for the whole thing.

\--

Really, is this the type of article we've become? I mean, column-oriented
datastores and key-value datastores are both valuable technologies, but a
NoSQL flame written by someone who doesn't know that Facebook is still MySQL
backed and uses Cassandra for only specific things and that they use many
different technologies where they're good (like heavy use of memcached as a
key-value store to reduce load on MySQL where queries aren't needed). Or
perhaps that the article doesn't seem to acknowledge the drawbacks that the
authors of these systems would talk about. For example, mongoDB states that
they lack ACID integrity and that it's more suited for "High volume, low value
data" - that should be read as whether someone has upvoted or not; if that
data gets lost it doesn't really matter so much because the person can click
the vote button again and all is well or just because it isn't super important
data.

Why are we supporting "we hate this technology just because"? Technologies are
tools that have their places. It's foolish to write off a good technology just
because you want to make fun of it or deride it. Use the right tool for the
job. In your web application, you might find that using multiple tools in
concert is the way to go.

~~~
drawkbox
"They all have trade-offs. Key-value stores eliminate your ability to access
data by anything other than its key. So, you have an article with an id of 5.
What if you want to look up the articles by author 9? "

For cases like that you don't have one source of data, you do almost like an
AFTER INSERT trigger to update other flattened tables of data that you need.

For instance, where you would normally perform a JOIN you would just after
save update a JOIN table or a VIEW that is created by incremental updates.

So then you could filter by many other things and really aren't limited too
much by space to make those duplications.

It happens with large RDBMS systems anyways after about 5GB of data, same type
of scalability decisions have to be made.

But yes the article is off and you are right, most likely systems will
approach this with multiple solutions for specific problems.

~~~
gaius
_It happens with large RDBMS systems anyways after about 5GB of data, same
type of scalability decisions have to be made._

More like 5Tb these days.

~~~
drawkbox
I should have clarified, I meant around 5GB per one type of data. Maybe a
table or dataset that needs to be merged. For instance if you had a product
database table that needed to be joined with another GB large or set of data
at runtime.

At that point, even in RDBMS you have to stop with the joins and flatten. I
have experienced these limits in Oracle and MSSQL around 20-30 million row
tables.

Scalable fundamentals like database flattening, dimensional modeling etc are
all what key value stores give you from the start. But really a good mix works
best or project specific, just saying in the future with TB, PB of data the
JOIN is a historic remnant.

It is possible one day that RDBMS will be seen as one of those evil
optimizations that we made in our small relative worlds at the time.

~~~
gaius
I've tables much, much larger than that.

 _It is possible one day that RDBMS will be seen as one of those evil
optimizations that we made in our small relative worlds at the time._

I doubt that for the simple reason that relational databases are maths: the
relational algebra and relational calculus. There's no such theoretical
underpinning to object databases.

~~~
drawkbox
_I doubt that for the simple reason that relational databases are maths: the
relational algebra and relational calculus. There's no such theoretical
underpinning to object databases._

True, the relational architecture is moved to the object or code level, if you
think about it in relation to size, it makes more sense there for the future.

The RDBMS was the brain but it is also the storage, code will now dictate how
to use the storage and not the storage itself, the storage is becoming a
component.

------
geebee
This movement sounds reactionary, and perhaps for good reason. SQL has been
the default method for storing and retrieving data for a long time, but
obviously it isn't the best fit for every case.

That said, I like sql. I find it extremely easy and natural to keep relational
models in my head and write queries. While I do appreciate some of the tools
that free me from writing very basic CRUD stuff or persisting objects, I don't
like to be too insulated from my database, and I'll gladly drop to sql when I
need to.

The problem is that sql has been used for _all_ data persistence and retrieval
for far too long, when all kinds of different models might have been
appropriate. So I guess this is a kind of backlash. And eventually, there will
be a backlash against the backlash - or, more likely, people will forget there
ever was an argument, and we'll get a new generation of developers who think
they've discovered relational algebra for the first time.

~~~
silentbicycle
Another thing is that some people have problems with SQL itself, not
relational databases per se. SQL is just the compromise-of-a-standard that
came out of several RDBMS query languages competing for mindshare.

And, naturally, in some peoples' minds any problems that MySQL has are joined
to SQL, and in fact all relational databases.

------
fauigerzigerk
There seem to be two quite contradictory arguments for NoSQL. One camp says,
we already have what we need in the form of [programming language] objects, so
why would we take everything apart and map it to something as reductionist as
relational tables?

The other camp says, relational tables are soooo complicated, we need
something even more reductionist and free form, so key-value pairs are just
right.

I can understand the second argument, but the first is a step backwards unless
your programming language is something like lisp, which has a reductionist
data model itself.

Object oriented models are fine if you write one application. But they tie
functionality to data very closely and that makes repurposing of data for
seperate applications much more difficult. So how do they analyse their data
or use it in more than one application?

They have to write a lot of code using an application specific object oriented
API instead of a general purpose reductionist data model. That's horrible but
how horrible it is only shows a few years down the road. It's a disaster in
the making. A whole army of programmers will be required to extract data from
all those big APIs of legacy apps.

I'm convinced that data has a different life cycle than procedural code and
therefore needs to be expressed in a simple, uniform, reductionist model
independent of all application code.

~~~
ubernostrum
"Object oriented models are fine if you write one application. But they tie
functionality to data very closely and that makes repurposing of data for
seperate applications much more difficult. So how do they analyse their data
or use it in more than one application?"

This was an extremely poorly-written and poorly-researched article. One
problem is that the projects being discussed are not object databases in the
sense of storing straight-up serialized representations of business objects.

Take CouchDB for example; the basic "thing" you store in it and get back from
it is a JSON object, which is a set of key/value pairs. In this sense it's not
too far removed from SQL-based DBs, because you still choose which bits of
data you're going to store (the keys in the object). It departs from the SQL
point of view in not requiring that all the records have the same schema, and
in not having a representation (at the data-storage level) of relations
between records.

This turns out to offer some big advantages: in my experience, the number-one
cause of unwieldy SQL is a schema that's had to grow over time to accomodate
ever more edge cases. Maybe it's been done through lots of nullable columns,
maybe it's been done through lots of related tables or some other mechanism,
but it frequently has to be done and ends up making the database painful to
work with.

CouchDB throws that out the window: edge-case records simply go in like
anything else, and if they don't have some particular field present in other
records, so what? The "query" is actually hitting a map/reduce in which the
"map" function can take whatever action it wants with a record that's missing
some particular key. It can skip that record, it can spit out a default value
for a missing key, it can do anything it likes.

"I'm convinced that data has a different life cycle than procedural code and
therefore needs to be expressed in a simple, uniform, reductionist model
independent of all application code."

Simple and independent, yes. Reductionist and uniform, no.

To run with CouchDB as the example, those assumptions are thrown out because
CouchDB essentially adds a layer to the stack. Traditionally, you have
application code over _here_ , querying data over _there_. In CouchDB, you
have application code which queries a CouchDB view which returns data, but the
query doesn't necessarily know anything at all about what the actual data in
the DB is, or how it's structured (or even if it _is_ structured in any
sensible way; maybe it's just a bunch of random key/value pairs). The view
layer is the part which cares about that.

And views are not static (or mostly static) things like the schemas in
relational DBs; views are free to evolve over time, you're free to add or
remove views in response to changing needs, and _the underlying data never has
to change as a result_. And so you don't need to agonize over the most
efficient way to reduce your data to a uniform schema. You don't need to
"migrate" your underlying data storage representation to change the types of
things you can store or the types of queries you can run.

~~~
fauigerzigerk
I think data independence all but implies a reductionist data model (uniform
probably isn't a very clear term). And when I say data model I don't mean a
particular schema but the primitives that are used to model the data, like
relations and attributes in the relational model or key value pairs in
Berkeley DB.

I cannot say anything useful about CouchDB as I don't know it nearly well
enough. What I think doesn't work is to hide data behind a procedural API when
it comes to read access (write access is a different matter).

A procedural API is a black box that you cannot reason about and has a very
application specific purpose that doesn't lend itself to analytics apps.
Analytics apps should know as little as possible about particular
applications. They cannot easily call arbitrary functions.

My experience with data centric apps is that it's a good thing to have that
situation where everything is a table and each of the few operations you have
creates another table. Tables in, tables out. The same thing works with lists,
key/value pairs, etc.

My battle cry would be "No applications specific APIs" (for data access)

------
sant0sk1
My biggest hold up to investing time learning how to harness "noSQL" data
stores is that there are so many young-ish projects and I'm afraid of picking
a loser. Couchdb, mongoDB, Redis, TokyoCabinet, etc, etc.

Seems like it's too early to cast lots, and I'll just stick with what I know
for the time being. Is that prudent of me or just naive?

~~~
mahmud
The noSQL school includes both key/value databases and bonafide object stores.
None of this stuff is new, BerkeleyDB, circa 1986, is as industrial strength
as they come. O2, GemStone, Statice, etc. all go back to the 70s and 80s.

Just because something is on github doesn't mean it's ground breaking ;-) The
future is full of wonderful things that are yet to reach us from Lisp and
Smalltalk research of the past.

~~~
silentbicycle
The sticking point with BDB is typically its license - if used commercially,
you suddenly need a commercial license from Oracle, and the cost is
unspecified.

------
wvenable
Performance is always a trade-off; wou rarely get something for nothing. These
object stores are great when all your doing it storing objects, but that's a
pretty simplistic use-case. If you need to do more with your data than simply
read it back in object-sized chunks, you're going to find that a relational
database probably provides better performance. For web sites especially,
retrieving and constructing objects is a fairly expensive operation if you're
just generating list pages.

I wonder if we just have a generation of programmers who can't think in SQL.
They're used to using ORMs (which are useful abstractions) but can't work at a
lower level. Sending a query over the line to get exactly the results you
want, and no more, and have it optimized and run entirely on server is pretty
damn efficient.

~~~
gaius
_I wonder if we just have a generation of programmers who can't think in SQL._

I think that's probably true. SQL requires thinking declaratively, not
imperatively, and in sets, not objects. Kids these days don't understand the
difference between a table definition and a class and the difference between a
row and an object. The declarative/functional style isn't really taught
anymore; everyone just wants to learn Java and get a job.

If anything, ORMs are _less_ "scalable" than SQL because under the hood the
system _is_ based on SQL, and in an effort to be completely generic, the ORM
generates really bad SQL.

~~~
rmaccloy
Indeed, "kids" (by which I assume you mean "assembly-line CS grads") these
days are pretty much using Hibernate, ActiveRecord or (I guess?) LINQ. This is
a good thing: The O/R mismatch exists. For good programmers ORMs _are_ more
productive in most cases (and you can write raw SQL for the exceptions); bad
programmers have _never_ gotten the relational model (and there's plenty of
legacy code to prove it.)

However, the discussion about "kids" is a total red herring, because I'm
pretty sure all of the people involved in talking up Tokyo Cabinet,
memcachedb, Couch etc on one end and Cassandra, Hypertable or HTable on the
other are fully aware of the difference between a class and a table. The truly
clueless don't even know the discussion is happening.

The object/document database revival circa 2009 is about three things:

A. People are _already_ using object-based access for almost everything, so if
they're _not_ using relational features it's trivial to drop in an object
store and get better performance/less overhead.

B. On the low-end, sometimes a relational database _is_ too much overhead --
if not in performance, in administration. (SQLite is cheap, sure, but
sometimes you _want_ just a disk-backed hash table.)

C. On the high-volume-data-analytics-end, RDBMSs don't perform well enough-
unless you shell out serious cash for Oracle or Greenplum, Aster Data,
Teradata, etc, and even then they still can't handle the volumes MapReduce and
Bigtable-alikes can.

~~~
wvenable
A. That is fine until the first time somebody has to write a report.

B. With the right product, administration is easy. MySQL is completely
ubiquitous and something like SQL server is easy. This certainly conflicts
with point C.

C. Most of these people aren't doing high-volume anything. If you are doing
high-volume stuff (like Google) then of course it makes sense to use something
very specific to your task. But that doesn't mean Bigtable, for example, makes
for a good _generic_ solution.

~~~
rjurney
I think the key insight in all this is that the same database you store your
objects in... shouldn't be the same one you write reports from. You can have
two. One object store for your application, and another analytic DB for your
reports.

The complaints about SQL is that is optimized for report writing, not
application development. Fine, so split that out.

Which is happening. Which is working.

~~~
div
Do you happen to know any articles on people splitting their data across 2
types of storage this way ?

If there is any data that should be stored in 2 separate stores, I can see
applications becoming messy rather fast in trying to replicate changes.

Unless of course this is usually done by having a clean separation between
which data goes where, but I doubt if any domain can ever really be that
easily split up.

~~~
rjurney
Well, for starters - most companies don't run reporting queries on their
production SQL database. They mirror, summarize, partition, index and cube a
separate reporting DB, so that big/mean queries don't cause massive latency on
their site/product/production system. Which isn't quite the same as using a
different data-store altogether, but some kind of split is commonplace. Which
means that some kind of difference between queries/applications in
reports/production is already common. They key here though, is that setting up
an RDBMS that can handle analytics on even a moderately large data-set is a
major task, can be complex/pricey, tends to use big iron, and only scales so
far before it gets very, very expensive.

But, yes there are examples of what I just described. In practice, in many
problem domains, most data of interest for reports does not change once it is
written, so syncing up is not a major issue.

Streamy is a good example, I think. They use HBase for the front end, and run
MapReduce jobs on the back end. [http://wiki.apache.org/hadoop-
data/attachments/HBase(2f)HBas...](http://wiki.apache.org/hadoop-
data/attachments/HBase\(2f\)HBasePresentations/attachments/HBase_Goes_Realtime.pdf)
Another presentation is here: [http://www.docstoc.com/docs/2996433/Hadoop-and-
HBase-vs-RDBM...](http://www.docstoc.com/docs/2996433/Hadoop-and-HBase-vs-
RDBMS) That is Hadoop and Hadoop, which is nice - but HBase is optimized for
the front end and is fundamentally different than typical batch operation of
Hadoop.

CouchDB sort of takes this approach, albeit with key/value and pre-defined and
materialized map/reduce views on the same store. I think this dichotomy will
become increasingly common, and will be less cumbersome than it currently is
as the tools mature.

Key/Value for the front end and Map/Reduce on the back end makes a lot of
sense for a lot of problems, since key/value is how many applications actually
work, and there is the added benefit that systems like these scale linearly on
commodity hardware using FOSS, can make it cost effective - and much simpler,
than scaling a traditional RDBMs as an analytic data-store. The upside to this
is too good for these systems not to win a big chunk of the market. And you
can have your SQL - albeit on top of MapReduce - in reports, where it belongs
:)

------
antirez
At the point you see the load drop on the following graph, we switched from
MySQL to Redis:

<http://db.lloogg.com/load.dir/load-1month.png>

Alternative DBs are not good for everything, but SQL databases are not good
for everything as well...

~~~
sounddust
I'm sorry, but this graph says nothing. Load average is meaningless; for
example, one process using 100% CPU at all times would show a load of 1.0, and
one process that repeatedly spawns threads but uses little cpu can push the
load average very high.

What's the difference in maximum number of pages served per second using ab?
Or average CPU usage? Or average disk throughput?

I'm not saying I disagree with your conclusion, but this doesn't prove it.

~~~
antirez
got your point actually, you can see all the data here:
<http://db.lloogg.com>, there are many stats about load, disk I/O, and so on.

~~~
sounddust
That's much more interesting data, but it clearly shows that the reason Redis
was performing better is because it's caching a ton of data in memory, whereas
MySQL was constantly hitting the disk for access. This seems to be a
misconfiguration of either your kernel settings or MySQL settings. If Redis
effectively cached your entire dataset in RAM, then MySQL could have done so
as well (or the kernel could have). I'm guessing that you configured MySQL in
a way that caused the kernel to fight with the database over memory usage.

------
mblakele
This seems oversimplified to me. When I think "database" I think of ACID
properties. You can have SQL without ACID, and you can have ACID properties
without SQL.

~~~
evgen
The problem with this simplification os the problem-space is that ACID
properties cannot be maintained (see CAP) when you start moving to a
distributed model. The "no SQL" group is just using it as a catchy hook upon
which to hang their hat, it is not SQL per se but the traditional RDBMS that
is reaching its limits, but it was hard to make a memorable name that included
"RDBMS" :)

~~~
gaius
_ACID properties cannot be maintained (see CAP) when you start moving to a
distributed model._

You need to pop back to the 1960s and tell IBM that before they wrote TPF.

------
gsteph22
I wrote a fairly detailed post about the Death of the Swiss-Army RDBMS on my
blog, <http://www.roadtofailure.com> :)

The RDBMS has been used for way too long as a generic "bucket" for any sort of
data storage and analytics. As a result, speed and scalability have suffered.
With Social Media and the advent of "web-scale" data, engineers need to
realize that the storage and analysis solutions they use must be driven by the
type of data they're storing, and what they want to do with it.

As I stat, "The ACIDy, Transactional, RDBMS doesn’t scale, and it needs to be
relegated to the proper dustbin before it does any more damage to engineers
trying to write scalable software."

------
gits_tokyo
Easy in != Easy out ... didn't your mum ever tell you, "there's no such thing
as an overnight success"

\---

Personally, I have absolutely no rush in dropping MySQL as a persistent store.

------
hello_moto
As far as I know, HBase is the software that they should point out in the key-
value discussion, not Hadoop.

------
schammy
This article is the biggest load of fucking bullshit I've ever read. That's
all I'm going to say about it. The author couldn't tell his ass from a hole in
the ground.

~~~
sho
I am glad you are getting modded back up from the grave; I agree and
appreciate someone calling a spade a spade.

We don't always need to get into thesis-length technical arguments about the
merits of so and so viewpoint - sometimes an article is just _bullshit_ and
deserves to be dismissed as such.

~~~
sho
Quite a turnaround. The GP was at -2 when I wrote the above.

HN is a fickle mistress indeed!

------
Devilboy
I'm not convinced. Of course these projects are useful for some applications
but most online services will never grow to the size where you HAVE to dump
your SQL server for something more scalable. And in the meanwhile SQL is just
sooo convenient and easy to develop on. Why would I want to switch unless I
have to?

~~~
gnaritas
> And in the meanwhile SQL is just sooo convenient and easy to develop on.

No, it isn't, it's a pain in the ass because it doesn't match how applications
are programmed. I know SQL pretty damn well, and I know how to tune a database
very well, and SQL still sucks.

~~~
dasil003
It has it's pain points to be sure, but the fact that "it doesn't match how
applications are programmed" is because it's modeling data not code. Yes your
object store will make the application much easier to write initially. But
then what happens when you need reporting or worse, some feature that you
never anticipated that requires a different view into the data.

That's not to say SQL is the end-all-be-all, but people are so quick to
dismiss it, and I bet there is going to be _a lot_ of pain felt as code based
on simpler databases ages.

~~~
gnaritas
> But then what happens when you need reporting

Then I replicate the applications data out into a relational database for
reporting. 99% of what applications need isn't reporting. I want to use many
different data structures, not be forced to fake everything with a table, or
continually reassemble things from tables.

SQL has its strong points, being a convenient data store for applications
isn't one of them.

~~~
dasil003
Again, it's not convenient for greenfield, but applications don't spend most
of their time in that state. A good ORM is a compromise between holding onto
that flexibility and being able to reconstruct your data objects easily.

I realize that there are a lot of applications where you will never need that
flexibility and a simpler database would be the right choice. My attitude is
based a response to the unbelievable hype surrounding these technologies among
a certain class of tech blogger. Given how quick many people seem to be
willing to chuck the relational model out the window without acknowledging any
of its benefits, I predict there will be a lot of rude awakenings and failed
projects in the next 10-20 years.

~~~
gnaritas
> I predict there will be a lot of rude awakenings and failed projects in the
> next 10-20 years.

As opposed to the many many failed or stuck projects out there now because
they made the mistake of using a relational database as an integration point
between many applications and can't risk changing anything now because that
one big global variable (the db) can't be touched without breaking 15
applications.

The relational folk don't like to admit their shortcomings either.

------
c00p3r
No to SQLServer and Oracle. =)

