
NoSQL vs. RDBMS: Let the flames begin - sstrudeau
http://stu.mp/2010/03/nosql-vs-rdbms-let-the-flames-begin.html
======
JunkDNA
I completely do not understand this sort of thinking. It's like saying, "All
you people driving motor vehicles: trains are better". It's a meaningless
comparison. NoSQL systems are really great for certain things. RDBMS systems
are really great for certain things. What I would _really_ like to see is if
all this effort writing these kinds of articles went into well-thought out
pieces that discusses very specific cases NoSQL backends provide an advantage
and why. Use cases that are not a blog, digg, or twitter clone would be most
helpful. Some of us have to work outside the bay area for companies like
banks, insurance companies, hospitals, etc... Can I use Cassandra as a data
warehouse for electronic medical records? Hell if I know, without actually
having to learn it and implement it to see if it works.

~~~
siculars
So my thinking on this is that the way these NoSQL systems will make their way
into Healthcare is through the great backdoor of analytics. Virtually all
available EHR/EMR systems available today allow various mechanisms for data
retrieval based on primary indices. Unfortunately that avenue does not lend
itself to secondary data reuse which will become more and more valuable as
Institutions realize their data is valuable not only for academic research but
for operational efficiencies re. money in the kitty.

Even more unfortunate still is that virtually all these installed systems do
not have the performance capacity or advanced search-ability to adequately
mine this growing horde of data. Administrators, under capex constraints, do
not allocate resources for secondary systems which would duplicate data for
mining purposes while alleviating strain on the principle production system.
The no money budget problem will lead in house programmers to build these
research systems on top of open source NoSQL solutions. There, the technology
will prove itself.

Additionally, "NoSQL" comes in different flavors. Generally, all of them forgo
the Consistency in CAP for Availability and Partition tolerance, which is fine
for many use cases - just not primary medical data acquisition use cases. As
the field matures programmers and system designers will learn how to make this
work better to the point where one day NoSQL systems may be used as the
primary data repository for medical data. However, that day has not come. For
instance, Riak allows you to tweak knobs in order to favor of certain aspects
of CAP theorem at different times while in production (specifically the w and
dw parameters, [http://blog.basho.com/2010/03/19/schema-design-in-riak---
int...](http://blog.basho.com/2010/03/19/schema-design-in-riak---
introduction/)). But having just started working with Riak in the last month
or two I would still only use it as an analytics tool exposing my medical
record data to m/r jobs at this point. And before jbellis smacks me, I think
Cassandra is awesome and I'm looking forward to spending some time with it but
I'm still not putting my med app data in Casssandra just yet as a primary data
store.

/Disclaimer. I work for a major University Medical Center and write business
web applications in this area./

~~~
JunkDNA
As a developer of a small, in-house specialized data warehouse for medical
research purposes (built with 100% open source), I would be curious to hear
more thoughts on this. The flexibility angle is certainly a huge win. My issue
is that I can't quite get my head around how one runs ad-hoc queries like
"give me everyone with a bmi >x and age between a and b". In the key/value
view of the world, that doesn't quite fit.

~~~
egroen
Take a look at the revolutionary illuminate Correlation database - ultimate
flexibility for "ad-hoc" queries

The Correlation database is a NoSQL database management system (DBMS) that is
data model independent and designed to efficiently handle unplanned, ad hoc
queries in an analytical system environment. Unlike relational database
management systems or column-oriented databases, a correlation database uses a
value-based storage (VBS) architecture in which each unique data value is
stored only once and an auto-generated indexing system maintains the context
for all values. Queries are performed using natural language instead of SQL.

Learn more at: www.datainnovationsgroup.com

------
antirez
I don't want to enter into the details of the post, but it is impossible for
me avoiding a reaction to this: "Do you honestly think that the PhDs at
Google, Amazon, Twitter, Digg, and Facebook created Cassandra, BigTable,
Dynamo, etc. when they could have just used a RDBMS instead?"

It is really impossible to argument something based on the fact that people
that are supposed to be very smart are doing it. The only way to support
arguments is by showing facts...

shameless plug as I don't want to post a message just to say this, but isn't
HN too slow lately? I'm at the point that I visit the site less often than I
was used to do as I don't want to experience the delay.

~~~
aphyr
I think in some cases this kind of appeal to authority _can_ be valid.

Facebook has absolutely insane sparse matrices to handle. They handle
enourmous volumes of traffic querying very specific (read: not cachable
between users) datasets. Moreover, they've already invested mind-boggling
amounts of capital into their stack. Same goes for Amazon with Dynamo. These
people operate on scales that startups like us can't even comprehend; and
they've found it worthwhile to write their own datastores for those scenarios.
Moreover, their use of those databases has apparently contributed to their
success. That, to me, is strongly suggestive evidence.

That and HA/fault-tolerance is a no-brainer; Cassandra's scaling
characteristics rock the socks off of _any_ SQL DB I've used. The consistency
tradeoff is well worth it for some use cases.

~~~
wvenable
Absolutely. Facebook. Google. All great examples of the need for a different
solutions. But I'm not sure about Digg. It seems like a very straight forward
implementation would work for them. But given the small amount of information
they've provided about their setup, it doesn't sound like they've ever gone
for one.

Compare them to StackOverflow, which at recent evidence, has about 10% of the
traffic of Digg. They're running a very straight forward RDBMS configuration
on rather pedestrian hardware. If Digg has a 50 node cluster (for example),
StackOverflow should require at least a 5 node cluster.

~~~
aphyr
Yeah, I'm a little surprised that Digg has moved away for performance reasons.
Maybe their data model is fundamentally more complex than StackOverflow? Or
maybe SO has a better caching layer in front of the service?

~~~
mbreese
Or maybe Digg's model is flawed. I don't know if it is or not, but from
everything I'd read it was far from optimal. I'd love to see more about it
though. Now, relationship graph traversal is an issue for normalized
relational systems, but in these cases, things could be split pretty from
articles versus recommendations.

One big problem I see in these comparisons is when a NoSQL person claims that
their box is processing 5000 req/sec, what does that mean? Are they
denormalizing this so much that it's equivalent to 500-1000 req/sec on a
RDBMS?

Another thing: when Digg was starting their type of site was very novel. There
wasn't much out there that approached the scale and growth they experienced.
I'm sure that StackOverflow has been designed with scaling in mind.

------
meroliph
"Let’s say you have 10 monster DB servers and 1 DBA; you’re looking at about
$500,000 in database costs."

I wonder what he thinks is a "monster db server", and considering he included
the DBA in the price, is this the price per year or what?

Having recently set up a dual E5620 with 48GB of RAM and 8 SSD drives(160GB
each) with a 3ware controller as well for just shy of 10K USD, I guess my
understanding of "monster" is quite different. For 13K USD the same server
would have 96GB of RAM.

~~~
wvenable
The numbers in the article are strangely inflated. In addition, as if the need
for DBA disappears when you've simply changed your data storage software.
Somebody still has to know how it works and manage it.

If you don't need a 50 node cluster because your RDBMS is pulling down big
numbers, then you don't multiply the cost of the RDBMS solution by 50 either.

The numbers posted here are pretty reasonable. 37Signals spending $7,500 on
disks isn't outrageous. That's less than the cost of a single developer
integrating a different solution over a few months. How long has Digg been
working on this transition and how many employees did it require? They've
probably spent a fortune. Just not on hardware.

~~~
joe_the_user
It is arguable that the need for a DBA does disappear with the change in the
systems. The base approach of the relational model is that there is a
professional DBA who makes sure that the mission-critical data is available in
a format that multiple applications and multiple department can use directly
(the type of data is know but the ordering isn't set). Thus DBA is "hat" that
most programmers can't "wear", especially since a large portion of programmers
don't understand the relational model.

On the other hand, Nosql and object-databases allow a programmer to just stuff
data into the a data-store without worrying about a cohesive datamodel. _If_
we consider this as mission-critical data that multiple departments of a large
organization would want to see in multiple forms, then we can find many ways
that the approach of _"just save this array of values"_ produces serious
problems.

But there are many applications where these problems don't appear. Diggs seems
like it could get away with doing nosql. A health-record site seems like it
could not do nosql since it ultimately is going to want ACID-and-beyond in its
data model.

~~~
shpxnvz
Good comment, but I wouldn't quite say that using a non-relational database
frees you from having to think about a cohesive data model. It might be more
accurate to say that for some data models non-relational stores are a more
natural fit which frees you having to think about how to force your model into
the wrong container.

~~~
joe_the_user
It's hard to find the right adjective for what's distinct about the relational
model.

The relational model is a fantastic model of data independent of application.
It can even be a great model for an application using the same data in
different ways.

But this approach clearly has a cost. In ways, there's the question - is this
an application with a company built around it or a company with a application
built around it? Digg and Google are applications with companies built around
them. Here the RDMS model doesn't make sense.

------
freshfunk
NoSQL vs RDBMS is really a proxy war for Denormalized vs Normalized storage.

You can take a system like Cassandra and treat data very much in a normalized
way which would reduce performance. You can take a system like MySQL and
completely denormalize your data which would increase performance.

Any test where one set of data is normalized and one isn't is not a fair test.

Also, denormalization can be a big deal. Unless you have some sophisticated
code managing it for you, you're trading performance for data storage
management complexity. Now you have to manage many instances of data X. But
there is a benefit in that you avoid crazy joins.

I think both have concepts to learn from each other. For example, in order to
use a NoSQL option effectively, you end up implementing your own concept of
indexes, something very easily done for you in RDBMSes.

------
Raphael_Amiard
It's very fun how basically most people agrees about this subject, that both
systems have their strong and weak points (for example i don't think i've seen
many articles saying that facebook/amazon _should_ have kept their whole
system running on rdbms), but still the endless queue of blog posts goes on.
Isn't this a case of violent agreement , as per
<http://c2.com/cgi/wiki?ViolentAgreement> ?

~~~
Semiapies
I don't know about that; the fact that the highest-rated comment on this page
is a rant written by someone who apparently didn't RTFA makes me think this is
just another holy war in early stages.

------
sunjain
I guess this is a history repeating itself - before RDBMS there were
hierarchical dbs, and other such technologies. There was a reason why RDBMS
won the battle. And probably the top most reason was the simplicity with which
you can define relation ship between data, store this data(with relationship)
and access it easily. We are all use to using SQL, and even though with all
the ORMs in the world, it is still probably a very simple(yet powerful)
language. If you look at HTTP, the corner stone of web, one of the reasons it
has been so popular, is because of it's simplicity yet powerful. And it can
thought to represent the same simplicity as RDBBMS (everything revolves around
some really basic operations - read, write, update, delete <-> GET, POST, PUT,
DELETE). And historically, invariability the technologies which hides the
complexity of what they do by providing a simple interface, normally win. And
that was one of the main reasons for the success of RDBMS. And which still
remains true. And that was precisely the reason why every one started using
RDBMS for blog type application, even though it is not the best use of RDBMS.
Come to think of it, why would you want to store lot of text (content of each
blog) for each blog id in RDBMS - but people did (because it was so simple to
do, and that is what they are used to). Hence the use of RDBMS for these type
of applications was debatable to begin with. However if you look at
transactions management that is required by some of the financial
applications, for example, I doubt how far the NOSQL solutions will go in
satisfying the requirement. With regards to the scalability, I will not get
into the cost factor. Because there are different ways of calculating the
actual "cost" of something - if you are Google - the cost of having few PhDs
writing your own filesystem and db(or NOSQL) is not much compared to the
benefits you are going to get out of it. But if you are not Google, it is a
different story. So if we leave the cost factor aside, for a moment, the list
of options available with some of the high end RDBMS technologies(with regards
to performance), for example Oracle, are quite broad - from active/active
clustering technologies(RAC), from different indexing types(b-tree, bitmap,
clustered, index-organized), multitude of partitioning types - range based,
hash-based(combination of these), list based etc, the list goes on. Same is
true about Subase or SQL server(except of active/active clustering). So I am
sure the performance issues can be handled in these RDBMS technologies
(without just throwing hardware at it).

------
mark_l_watson
I've never read anything by Joe Stump before but I just bookmarked this
article so I can peruse more of his stuff later.

I liked the way he personalized his argument to his own deployment situation
rather than making generalizations. I also liked to hear about his experience
with Cassandra (5 minutes to clone a hot node and have it balanced and in
production).

------
ck2
Understanding the limits of NoSQL from a RDBMS perspective:

<http://www.youtube.com/watch_popup?v=LhnGarRsKnA>

(slides: [http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=no-
sql...](http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=no-
sqltalk-091114171610-phpapp01) )

------
timtrueman
NoSQL/NoREL is a tradeoff of features for performance. If you don't need
certain features and you can make the trade and need the extra performance
than it can make sense for some people. I don't think it applies to everyone.
They made the decision that was best for them and congrats on that.

Also I can say rotational disks may not provide the economics that make RDBMS
seem attractive—but FusionIO cards have really changed that. And I didn't just
read the datasheet and get a nerd boner. I watched the queries from 8 beefy
physical database boxes (that were getting hammered) combined onto one
physical box that was identical in all ways except it had an FusionIO card. It
handled 8x the number of queries with ease and could have taken a lot more
punishment. Yes, the cards are expensive but in the scheme of getting rid of 7
servers it was actually saving significant amounts of money.

------
chime
I just wish someone would offer a decent hosted NoSQL platform that I can
start using already. <https://cloudant.com/> is invite only. So is
<http://hosting.couch.io/> apparently. <https://app.mongohq.com/signup> is
overpriced, considering it's a cloud service and only 2GB. SimpleDB isn't bad
but it has tons of limitations (can't even sort by numbers).

This is why people still use MySQL even for projects that aren't suitable for
RDBMS. I use hosted MySQL at dreamhost and don't have to bother with anything
except my app and data. It just works and is free with the web hosting
package. Is there anything out there that comes close? I don't mind $1/month
for 1GB of data. $25 for 2GB is not worth it.

~~~
dryicerx
There aren't that many because there's not a big market for it. You ideally
want the DB in close proximity (latency wise) to your web server or what ever
uses it directly.

Why not host it your self? Deploying a server like MongoDB is trivial to get
going.

~~~
_pius
For people deploying to cloud services, in-cloud latency is all the same no
matter who the app/database provider is.

For instance, Heroku and MongoHQ are both using Amazon Web Services, so it
wouldn't increase the latency to swap a MongoHQ database instance for a Heroku
one.

~~~
chime
Precisely. Even if my latency is 0.1s-0.2s, that is still not a big deal if I
write my app well.

------
wvenable
This response to Forbe's article doesn't address the central premise:

"Shocked by the incredibly poor database performance described on the Digg
technology blog, baffled that they cast it as demonstrative of performance
issues with RDBMS’ in general, I was motivated to create a simile of their
database problem."

The central question here isn't so much the maximum performance you can get
out of RDBMS system, or how it compares to a NoSQL solution, but how Digg is
getting such terrible performance out of their RDBMS design! The numbers are
just don't add up.

This article is just a bunch of straw men and that avoids that main issue. And
arguing that $7,500 is too much for a serious web SaaS vendor to spend is just
comical.

~~~
wanderr
Didn't he?

"Has anyone ran benchmarks with MySQL or PostgreSQL in an environment that
sees 35,000 requests a second? IO contention becomes a huge issue when your
stack needs to serve that many requests simultaneously."

 _my_ answer to this point is that IO contention can be vastly reduced in
MySQL (and probably even better handled in Postgres, I bet) with some tweaking
of settings and lots of memory. Memory is pretty cheap these days, so stuffing
a server full of RAM is really not a bad option.

------
johnrob
It's just a spectrum. At one end, where all we do is write, we use log files.
At the other end, where all we do is read, we cache static content at network
hubs. There's a lot in between. An RDBMS is one tradeoff, 'NoSQL' represents
another. They simply imply different types of read/write tradoffs.

------
StrawberryFrog
_At Digg we had probably a hundred or so tables, each table had varying
indexes (a char here, an integer there, a date+time here)_

This may be part of the problem, actually. 100 tables to serve posts with
attached comments? Um.

~~~
blinks
There's a bit more going on there. See
[http://www.codinghorror.com/blog/2009/07/code-its-
trivial.ht...](http://www.codinghorror.com/blog/2009/07/code-its-trivial.html)
for a related principle.

~~~
houseabsolute
True, but I can't imagine more than a couple of those tables are accessed at
the high frequency rate that requires performance optimization.

------
hvs
Just because some bloggers are obsessed with arguing whether NoSQL or RDBMSs
are "better" doesn't mean we need to post every article to HN. Why don't we
agree to just stop?

~~~
gnaritas
Perhaps because plenty of people are interested in the subject. If you don't
like it, don't read it.

------
Zak
_Furthermore, this is an operational expense as opposed to a capital expense,
which is a bit nicer on the books._

Something about this seems broken. Why would it be inherently "nicer" to spend
money on a service as you use it than on a product that you get to keep?

~~~
wrs
Because we're talking about a business. Capital expenses must be written off
as depreciation over time. Operational expenses can be written off
immediately. For tax reasons this can be a big deal.

However, you can buy servers through a leasing company to get this benefit;
you don't have to use EC2.

------
japherwocky
<3

NoSQL is faster to develop/prototype with also, since you only need to
understand json dictionaries.

So your shit ships faster, and scales cheaper.

------
onetimeiter
use whatever makes your product useful!

