
Dear NoSQL: "SQL-isn't-scalable" is a lie - portman
http://www.yafla.com/dforbes/Getting_Real_about_NoSQL_and_the_SQL_Isnt_Scalable_Lie/
======
JunkDNA
This is a very reasoned, well-written article. While I like the idea that the
NoSQL movement is questioning some core database assumptions, I have often
been uncomfortable with the meme that "SQL doesn't scale". Scalability is
always a function of your circumstance. This article makes a good point that
the group of people for which SQL doesn't scale need another option (internet
startups with little cash and massive server load). While that case gets a lot
of attention around here, it's important that people not try to extrapolate
that experience to other areas of IT.

~~~
evgen
What the NoSQL "movement" is doing is questioning some basic axioms of
architecture and operations that may no longer hold in the modern era. There
are many aspects to data, how it is used, how it should be stored & accessed,
and what is or it not important within the set of available data. The RDBMS
cabal set a particular set of standards (ACID) that were a good match for a
lot of the early "big data" systems and the computing infrastructure of the
time, but times change. As more people are bringing new data sets, new uses
and access patterns for this data, and the computing infrastructure is
changing from large servers to swarms of small systems the options for storing
and accessing this data are also changing. I don't know many in the NoSQL
crowd who are actually opposed to a RDBMS as a solution, but it is not
necessarily the first choice for any problem; most of the people in this group
seem to start by considering the data and the application and then picking and
choosing the characteristics of the data system that will be used to meet
these needs. The massive and highly distributed systems get a lot of attention
(and they are almost always NoSQL systems) but that does not mean that
alternative data systems do not have a place up and down the IT stack, nor
does it mean that a RDBMS is not a good option for some situations.

~~~
WorkerBee
_The RDBMS cabal set a particular set of standards (ACID) that were a good
match for a lot of the early "big data" systems and the computing
infrastructure of the time, but times change_

Times change, but in the sense of adding additional systems with new
requirements.

The existing systems have not gone away, and are not going anywhere. In fact
there are more than ever of them around. I certainly still want my bank to run
a system that has ACID transactions. Likewise my mobile phone's billing
system, utilities bill, etc.

------
keithwarren
I wonder how many people in the NoSQL and SQL doesnt scale crowd either have
never met a truly competent, much less good DBA (trust me, they are very very
rare) or decided it could not scale because they applied their programmatic
and procedural logic to a tool that operates in a very different (SET based)
paradigm.

~~~
waterlesscloud
I've always thought good DBAs were the rarest thing in the industry. And the
most valuable.

~~~
evgen
"Good" DBAs share digs with santa and the tooth fairy, but competent and
opinionated DBAs are not as hard to find as you might think. Just whisper
'NoSQL' and they seem to crawl out of the woodwork :)

------
wooster
"Such a platform can yield very satisfactory performance for tens or hundreds
of thousands of active users"

There are 253 million Internet users in China alone. What happens when your
site needs to scale from 0.001% of them using the site simultaneously to
needing to scale to 1% of them using the site simultaneously? Within a month?

"Of course if you index poorly or create some horrendous joins"

Which in the Twitter and Facebook cases is exactly what they have to do on
many of their requests. As I've personally found out, relying on a database to
do a join across a social network graph is a recipe for disaster. One day
you'll be woken up because your database's query planner decided to switch
from using a hash join to doing a full table scan against 10s of millions of
users on every request. Then, you'll be left either trying to tweak the query
plan back to working order, or actually doing what you should've done in the
first place: architect around message queues and custom daemons more suited to
your query load.

"Even with billions upon billions of help tickets."

At 50 million tweets a day, Twitter would hit 18 billion tweets within a year.
Good luck architecting a database system to handle that kind of load. That is,
one in which the database system is serving all of the requests (including
Twitter streams) and isn't just being used for data warehousing.

"Such a solution — even on a stodgy old RDBMS — is scalable far beyond any
real world need"

The disconnect here is this guy's needs are not the needs of a lot of us who
are actively looking at alternatives. He is simply not familiar with the
problem domain.

~~~
donw
I don't think the original poster was making the argument that Twitter should
run fine on a SQL database; in fact, I think he seemed to indicate the
opposite. Namely, that large, nominally non-relational datasets that can
afford to lose a little data here and there, or at the very least just take
awhile to save it, are really what you need for serving up a big, fresh pile
o' Social Networking.

So, bringing up Twitter or Facebook really doesn't make a good case against
RDBMSes as a good tool in the toolbox -- they've got a very unique set of
needs that don't apply to a lot of rest of the world. So, of course, SQL isn't
the best solution when you're dealing with trillions of rows of data, and
don't really want to spend hundreds of millions of dollars on the
infrastructure required to guarantee that you never go down, and never lose a
tweet.

And keep in mind, RDMBS helped them get to point where they could enjoy these
problems; Twitter probably wouldn't exist in all its current glory if they
spent a year building it to be 'scalable' before launching.

I think the reason that a lot of people end up hating RDBMS and SQL is because
of one-or-more of (a) their only experience is with MySQL, which really isn't
that awesome; (b) they've been burned by bad schema design; or, (c) they don't
really get relational algebra or set theory.

For an example of 'bad schema design', I once worked at a company that had
indices on nearly every column of their DB, even though almost none of these
ever got queried. There was one database table with _five_ indices on three
columns, and of course this was the table that logged _every single HTTP
request_ processed by the front end. Including API calls. Did I mention that
this table was never queried by any part of the application?

It was a poor design decision, and sure enough, it completely torpedoed
performance. But the problem wasn't the RDBMS, because it did exactly what it
was told to do, no matter how asinine.

So, in short, RDBMS aren't the solution to all problems, but they do solve a
lot of problems adequately. NoSQL databases also serve an important role in
the toolbox, but are much more narrowly-focused.

~~~
wooster
You make some good points.

However, the original author's point basically boiled down to: if you define
scalability as the problems you can scale an RDBMS to solve, RDBMS systems are
scalable. I'm not big on arguing the finer points of someone's tautology.

The particulars of a situation determine the scalability of a solution. For a
lot of us working at web scale or on interesting new problems, an RDBMS won't
scale. Sometimes it won't scale within the constraints we have, but sometimes
it won't scale because we won't be able to build the system we're trying to
build. His example of a company-internal billing system really only served to
highlight the disconnect between the crowd following along well-trod ground,
and the people out front doing innovative work.

~~~
lucifer
_"[SQL deal for when] "Data consistency and reliability is a primary
concern"."_

I'm curious: lets say you have a tweeter scale app that must satisfy the
consistency and reliability as a primary requirement. Is there really a NoSQL
solution that can take you there without (effectively) raising the costs to
the point that a scalable (money not an object) SQL solution would provide?
(Kinda like how the difficult to extract North Sea oil became economically
viable once oil prices crashed through a certain ceiling?)

~~~
wmf
The essence of NoSQL is that it gets its scalability by giving up consistency
and reliability. Trying to run NYSE or Visa on NoSQL is pointless.

~~~
evgen
Well, "reliability" can be sliced in a couple of different ways since that
term can cover both the A & P in the CAP options and it can also mean the
elimination of single points of failure and an architecture that can degrade
gracefully when components fail. Some NoSQL systems let you select the mix of
consistency and reliability you need at a rather fine-grained level -- one
thing that does distinguish these systems from the traditional RDBMS is that
you are almost never in an all or nothing situation regarding any particular
part of the data space unless you explicitly want to create that choice to
enable other options.

------
nostrademons
I thought his example really telling.

Say that his internal help ticket tracking system was built for IBM, one of
the largest corporations out there with 300,000 employees. 300k users is
_tiny_ for a consumer app. We had more than that when I was volunteering for a
Harry Potter fanfiction website. Even if he was working at the largest company
on earth by employee count (Wal-Mart), he'd still have fewer users than we had
for Harry Potter fanfiction. And usually employees don't submit more than one
or two help tickets a day, while Harry Potter fans tend to view a forum thread
every minute or so.

It really hits home how consumer data processing has changed the game for data
management. When I was working in the financial industry, we dealt with about
50GB of data/day coming off the exchanges. I thought that was a lot. But at
Google, there's _terabytes_ per day - at least two orders of magnitude more -
and the total volume of financial transactions is basically rounding error on
the data we handle.

It makes sense that with this exponential explosion in data, we'd need
different techniques to handle it. Quite likely, RDBMSes do scale for the
scale he's talking about. But a bunch of industries have opened up within the
last ten years that require several orders of magnitude more data, and it's
naive to think that just because it works for a corporate help desk or POS
system, it'll work for a system that logs every page view and every action for
millions of users.

~~~
ergo98
>300k users is tiny for a consumer app.

Active users, not the total number of users in the user table. People grossly
overestimate the scale of most web properties, where the number of active
users is far lower than you likely imagine.

>But a bunch of industries have opened up within the last ten years that
require several orders of magnitude more data, and it's naive to think that
just because it works for a corporate help desk or POS system, it'll work for
a system that logs every page view and every action for millions of users.

Strangely it says nothing of the sort. Yet here, again, you've used Google as
the example. How many Googles are there? How much does that apply to about
99.999% of people who deal with databases?

Yet it always appears as the example.

~~~
petewarden
> How many Googles are there?

There's tens of thousands of alternative search engines out there, see this
for a sample from '07:

[http://www.altsearchengines.com/2007/10/29/the-
top-10-list-o...](http://www.altsearchengines.com/2007/10/29/the-top-10-list-
of-search-engine-lists/)

Most of them fail, but there's a lot of people trying to solve Big Data
problems on a shoe string budget. That's part of the disconnect, us NoSQL
folks are excited by having cheap solutions to problems we wouldn't be able to
afford to tackle otherwise, whilst SQL folks are are shaking their heads at
the mess we'll have to clean up if our prototypes do become successful.

------
warfangle
While SQL can scale, I think this argument is a little spurious.

I truly dislike the NoSQL stance of "never SQL;" it has its place, and its
place isn't necessarily at the twitters or the facebooks of today. SQL scales
very well with datasets that make sense for RDBMS. CRUD style applications.
Core business apps. Data that doesn't necessarily need to be mined furiously.
Trying to shoehorn, say, a high-volume message system or intensely self-
referential (graph) dataset into an RDBMS is a recipe for disaster, however.
Many of the performance issues people see with RDBMS seem to stem from this, I
believe.

If your app is hugely real-time datadriven and the datamining (if necessary)
can be offloaded to cron jobs, a K/V store is great. And can scale very
quickly and relatively cheaply.

If you're doing something hugely relational that, if loaded into a SQL server
would require an immense number of self-joins (I'm looking at you, graph
analysis) that must be done in real-time and not offloaded to a cron job, a
graph database is probably the way to go. They're harder to scale to a huge
amount of data, but certain datamining tasks are made much easier - and don't
require distributed map-reduce execution. Scaling will get much easier once
some systems come out that use k-means (or similar) to cluster and shard the
data. That kind of smart scaling would be nigh impossible on either a KV store
or a traditional RDBMS. Google gets away with it with BigTable because they
can throw so much cheap iron at it - a truly brute force solution. The same
solution that needs to be taken with RDBMS when you shoehorn datasets into it
that don't make sense.

Emil Eifrem (Neo4J) said it best in his presentation (
[http://nosql.mypopescu.com/post/342947902/presentation-
graph...](http://nosql.mypopescu.com/post/342947902/presentation-graphs-
neo4j-teh-awesome) ): NoSQL doesn't mean Never SQL, it just means Not Only
SQL.

------
kennu
I don't think NoSQL people usually claim that SQL isn't scalable, just that
it's unnecessarily complicated to scale.

You generally have to partition your data horizontally and thus give up many
of the features that SQL has to offer: ACID transactions, unique keys, auto-
increment primary keys, etc.

Then you have to come up with your own solutions to replace those features:
eventual consistency, UUID keys, map/reduce, etc. And these happen to be
exactly the kind of features that many NoSQL databases can give you out-of-
the-box.

~~~
wmf
_You generally have to partition your data horizontally and thus give up many
of the features that SQL has to offer_

There are plenty of databases that will partition data without giving up any
SQL features, but they cost money.

~~~
jbellis
> There are plenty of databases that will partition data without giving up any
> SQL features, but they cost money.

They also either rely on a single huge SAN for storage (single point of
failure + expensive as hell) like Oracle RAC, or they require specialized gear
like infiniband to reduce intra-node latency like Exadata (starting price:
seven figures) or they're analytics databases that are designed for huge
queries with latencies to match like Vertica, ParAccel, etc. (Think minutes
between data being loaded and being available to query.)

I'll take NoSQL, thanks.

~~~
neilc
Would would the need for "specialized gear like infiniband to reduce intra-
node latency" be limited to parallel databases? (I assume you mean "inter-
node"...)

~~~
wmf
This whole discussion is about parallel databases since that's the only way to
scale beyond the performance of one machine.

~~~
neilc
Well, replace "parallel databases" with your favorite term for the parallel
databases that fall outside NoSQL (VoltDB, Exadata, shared MySQL, etc). My
point being that the alleged need for high-speed interconnects is orthogonal
to SQL vs. NoSQL.

~~~
jbellis
But it's not. Because SQL databases (strictly speaking any requiring strong
consistency... which is mostly RDBMSes) are highly latency sensitive, where
NoSQL databases like Cassandra design around that by saying "hey, you could
not see the most recent write for a few ms, unless you request a higher
consistency level." And most apps are fine with that. As a bonus you get
multi-datacenter replication with basically the same code, another place most
RDBMSes are weak.

It's a classic design hack -- redefining your goal as an easier problem.

------
Aleran
Was anybody else dumbstruck by the article's first comment by this bright
chap, "Jeff"?

 _Brilliant. I've forwarded this to my team._

 _We make a tax solution and I've been dealing with vague "we should use
NoSQL" comments from a few of the less capable members of the team._

If his team members read all the way to the comments it's going to be very
awkward tomorrow at work.

~~~
joe_the_user
Well if some of the dumbshits I work with at the White House read this I'll be
in trouble too but I don't think that's going to happen.

Care to buy some classified material?

------
viraptor
Reading this article was a bit annoying. Right from the start: "I work in the
financial industry [...] I worked in the insurance, telecommunication, and
power generation industries." I was thinking only - you're not even supposed
to look at nosql from that perspective - there's nothing for you there... just
go away.

It doesn't support transactions and acid most of the time. There's no "we pay
$xxxM for the support and blame you for everything" company in nosql products.
It's not the same workload as you'd expect from kv/document-store used as a
webpage backend.

One of the few serious "nosql" databases for enterprises like that is Berkeley
DB - it's got what they need. I'm not sure why did he write that blog post...
it just stated the obvious, but in form of a rant.

The funny thing though is - Berkeley DB is exactly what NoSQL is about... and
it is used for local reliable storage in many big enterprises. Replication,
logging, transactions, etc. - and it's just a kv-store really.

~~~
wvenable
I'm not sure why, these days, anyone would choose Berkeley DB over something
like sqlite.

~~~
viraptor
BDB has proper transactions, provides the same type-safety as sqlite (i.e.
none), doesn't have to go through abstractions like JDBC (overkill for
sqlite), is backed by Oracle, has pure java implementation, has replication.

Sqlite has columns and can search based on them. It can also save you a couple
of lines on manual joins. Does it provide something more?

(although tbh, I'd take Tokyo Tyrant over both of them - has columns, writer
lock + server-side scripts instead of transactions, same model of replication
as BDB)

~~~
Erwin
The berkeley interfaces that come out of the box with e.g. Python don't have
any of the advanced features that the sqlite interface has, e.g. concurrent
access. Any advanced usage of Berkeley DB is far more complex than sqlite,
e.g. try opening a database so that multiple users can read/write it. With
sqlite, no extra steps are necessary if you have an occasional separate
process that needs to access the database (obviously it's not built for
efficient concurrent access).

Ad-hoc SQL queries on the sqlite database are also a huge win. Much better
than defining your own data structures and then tools to read/write them.

I don't know you mean by "proper" transactions. Sqlite has transactions for
DML.

------
citatus
I wish that people would understand that people who disagree with them are not
'liars' but are instead people who disagree with them.

It is possible to honestly disagree on a technical issue, you know. And it is
good professional and personal practice to only accuse someone of lying when
you are quite sure that they are deliberately telling you something they know
to be false...

------
petercooper
I like the article, but the first comment left on it is telling:

 _I've been dealing with vague "we should use NoSQL" comments from a few of
the less capable members of the team._

This appears to be a common attitude of development managers in the corporate
world: that anyone who starts suggesting anything vaguely "new fangled" is
surely a naïve novice.. rather than being good at picking up and investigating
new technologies.

~~~
wvenable
Unfortunately, managers who aren't defensive about new technologies end up
working with XML databases! Not all new technologies are better and certainly
the benefits of NoSQL are up for debate.

~~~
petercooper
XML databases are exactly the sort of glossy new technology that stuck-in-the-
mud managers _did_ pick up on.. mostly because they respond to glossy vendor
pamphlets, sales calls, and trade show pitches than the grassroots findings of
their underlings.

------
evgen
My favorite part of the rant is the suggestion that the nosql alternative to
spending millions on a ginormous RDBMS was throwing away throughput by using
Amazon AWS. I guess any argument becomes easier to make when you define the
other side by its incompetence.

The only valid point made seems to be that vertical scaling of a RDBMS can be
a multi-million dollar exercise.

~~~
wmf
Unfortunately, what you call incompetence a lot of other people call best
practice.

~~~
evgen
If throughput and transaction/analysis speed is a requirement then AWS is not
the answer -- anyone who suggests otherwise has never used it for anything
larger than a toy dataset. I am currently migrating a large data analysis
system (20+ TB running over about 500 EC2 hadoop workers) to a dedicated
cluster because the internal EC2 latency reached a tipping point in our
analysis runs. If you have the dough to spend on a big vertical system you can
spend half as much on a dedicated cluster running a NoSQL solution and
probably meet the required spec. The original article was comparing a Porsche
Cayenne to a bicycle; the bicycle works for some problems and in certain cases
it (or a fleet of them) can solve the problem better, but it was a dishonest
strawman comparison for the subject at hand.

------
steveklabnik
I was really with this article, right up to this:

> If you lose a Status Update, or several thousand of them, it will likely go
> unnoticed.

What? If Facebook lost half of their photos, or of Twitter lost a few thousand
tweets, there'd be riots in the streets. Okay, maybe not quite that much
unrest, but still.

~~~
wvenable
I've run a few medium sized sites (with traffic most people here would drool
over) and I would say that people are much more forgiving of slow pages than
lost data. Losing a few forum posts would cause riots in the streets.

------
rmorrison
I'm curious how services like Amazon's RDS will change this perception.

A SQL database may be difficult to scale, but it is something that can be
largely encapsulated and outsourced. If Amazon RDS, or some other product,
handles the hardware and software configuration, then the developers can just
focus on the application portion of it.

This isn't to say that scalability is guaranteed, it's still important to
optimize queries and the data structure. Also, there are problems where NoSQL
is simply the better and/or cheaper option.

But if these services can encapsulate a lot of the difficult part of scaling
SQL, it still makes SQL a very attractive and powerful option for most(?)
problems.

~~~
wmf
RDS doesn't scale at all. The performance of RDS won't exceed that of MySQL
running on a "Quadruple Extra Large" instance (because that's what RDS is).

~~~
rmorrison
Ok, but how about a service taking whatever scalable SQL systems the financial
or pharmaceutical industry use, even Oracle, Postgres, or Microsoft SQL
server, and then selling it as a service similar to RDS?

My point is, I think there is an opportunity providing enterprise-level SQL
scalability on a per-use basis. It won't replace NoSQL systems, because there
are some problems where they're clearly better, but it could be done and
provide relatively-affordable, scalable SQL access to startups.

~~~
wmf
Enterprisey databases are just too expensive for the HN crowd; making them
into a service can't fix that.

Azure might come close, but I don't know what performance it can provide.

~~~
gaius
Azure doesn't run a lot of SQL Server functionality.

------
richcollins
One complaint that I have with SQL databases that you don't often hear
elsewhere is that they are very complex. People often use them where a much
simpler solution would suffice because they are what people are used to using.
All else equal, we should choose the simple solution over the complex one,
because it is less likely to fail and easier to extend.

------
gsteph22
The word "butthurt" comes to mind when thinking of this article. ;D

------
est
maybe SQL scales, ACID probably don't

~~~
donw
I would wager that SQL, and ACID, will scale larger than many startups will
ever actually need.

~~~
megaduck
True, but my sincere hope is to outgrow Postgres someday.

------
pw0ncakes
SQL has, by modern standards, a shitty API. The DSL is an ugly mess and it's
extremely difficult to reason about performance, especially when joins are
involved.

Ergo, some not-by-choice SQL users only scratch the surface and use common
features: INSERT, UPDATE, DELETE, SELECT, also known as CRUD.

Ergo, it becomes easy (although wrong) to conclude that SQL isn't doing very
complicated work behind the scenes and that it's just another overcomplicated
dinosaur POS like the Windows operating system, one that remains popular only
because it's a standard.

Thus, a lot of undue hate gets directed at SQL with little attention paid to
the subtleties of what it does extremely well and where it behaves poorly.

I think "SQL isn't scalable" is in the same league as "Java isn't concurrent".
It can be, if you have learned the necessary skills and are willing to deal
with a bit of pain. Is Clojure astronomically better for concurrency? Sure,
but people can and do scale with SQL databases, and they do this in large part
because there are problems for which SQL is the appropriate solution.

~~~
derefr
Why are SQL and RDBMSes conflated? Couldn't one database speak both SQL and
something else if it wanted to? In the same way that Clojure and Java both end
up as JVM bytecode, couldn't SQL just become another layer on the encoding
stack, above something simpler?

------
jbellis
tl;dr: "sql does so scale! if you throw $millions of hardware at it." yawn.

~~~
mmt
The article points out that it has _already_ scaled with hundres of kilobucks
or even millions.

However, it _also_ points out much lower-end hardware solutions that cost
under $10k but perform much better than the largest EC2 instance, for I/O.

ETA: This is why I tend to roll my eyes at the notion of "commodity" hardware.
The article's low-end array is 400MB/s, but rolling ones own can yield over
twice that for the same or lower price tag. All this _well_ before reaching
the unscalable cliff of enterprise pricing.

~~~
illumin8
Yes, but there's a big difference between a 400MB/sec. supported storage array
for < $10K, and a garage built "roll your own" storage array that can do
800MB/sec. for half the price.

The main difference is that, even in startup companies, you have a supported
solution and can call someone at 2 am when your storage dies and expect to
find replacement parts and support. Good luck trying to drive to Fry's and buy
replacement hard drives for a server that was built 2 years ago by someone
that no longer works there.

~~~
evgen
Yeah, right. I have worked on storage systems that run into the tens of
megabucks and even with pricey "4 hour" support we were often SOL when we
actually needed the company to live up to its claims. The different with the
garage-built system is that I sometimes have components sitting around my desk
or that can be pillaged from a VPs desktop to repair the system -- try doing
that when the specialized disk controller on your gold-plated solution goes
tits-up.

~~~
illumin8
Totally incorrect. Ok, go ahead and build me a home built storage array with
1TB SATA drives off the shelf. Then, 3 years from now, when one of your drives
fails and you don't have any spares, try to find a new one that matches the
exact geometry of the existing one.

What's that? You can't buy that exact drive so now your homemade RAID 5 is
running in degraded mode and you hope it will stay up long enough to copy your
data off onto another system? Sucks to be you, you tried to save a few bucks
and got burned.

In the enterprise, we pay big bucks because we want to KNOW that we can call
an 800 number and get an exact replacement hard drive, even if they stopped
selling them 3 years ago.

~~~
mmt
* Then, 3 years from now, when one of your drives fails and you don't have any spares, try to find a new one that matches the exact geometry of the existing one.*

With arrays I build, I don't have onerous constraints like requiring identical
size[1]. Moreover, if I'm not already already retiring disks at the 3 year
mark, I'm very much remiss in my duties.

* Sucks to be you, you tried to save a few bucks and got burned.*

We're not talking about a few bucks. We're talking hundreds of thousands of
dollars. That's enough to pay a salary for those 3 years as well as having
replaced with something less than a couple generations old.

[1] I assumethat's what you mean, since true geometry is all but impossible to
detect on modern drives.

