

SQL Databases Are An Overapplied Solution (And What To Use Instead) - conorgil145
http://adamblog.heroku.com/past/2009/7/8/sql_databases_are_an_overapplied_solution_and_what_to_use_instead/

======
wvenable
There's definitely a lack of imagination when it comes to proponents of NoSQL
solutions, and this article shows no exception. While we do want to store, for
example, an entire e-commerce order in a single operation we also don't
necessarily want to retrieve it that way. That type of storage makes otherwise
simple operations considerably more difficult. Do you want to know how much
you made in sales today? How many of widget #453 are still in stock? The most
popular items for sale? The least popular? Start writing code.

The other example of a user profile seems to be the perfect fit for storage in
a single table, so I don't why that's there. Now, do you want to know how many
users logged in this week? How about you want to delete every account that
hasn't been accessed in a year? Much more slow code.

Every article about NoSQL goes on and on about the supposed advantages, but
rarely talks about the considerable disadvantages. And honestly, for most
people, the advantages are simply not worth the trade off. I fear that we'll
have decades of e-commerce stores written with document stores and mountains
of code slowly chugging away to calculate the basic stats that any business
needs.

~~~
viraptor
Did you use any NoSQL solution seriously? Let's see some examples (imaginary
python-like interface, so I don't have to be language/backend-specific). Like
you say: Start writing code.

Do you want to know how much you made in sales today?

    
    
        sum(amount in db.filter(type='order', date=xxx))
        -vs-
        SELECT SUM(amount) FROM order WHERE date=?
    

How many of widget #453 are still in stock? (IRL it's never that simple,
but...)

    
    
        db.filter(type='stock_widget', part_id=453)['amount']
        -vs-
        SELECT amount FROM storage WHERE id = 'widget 453'
    

Most popular:

    
    
        for r in db.filter(type='item', date=xxx): histogram[r['part_id']] += 1
        histogram.sort_value()[0].key()
    

What I wanted to show is - you're writing the same amount of code for both
cases. In some databases (like Tyrant) you can also run the script server-side
and just report the result if you prefer. Also depending on the database, you
don't need to read the whole record every time - you can just request a list
of fields in most of them.

Have some fun with a NoSQL database before rejecting it for reasons like the
ones you mentioned... It's also not always about processing speed - I could
use either solution, but coding for TT is just simpler than for any SQL in
most of what I do (see how my db.filter examples give you the solution in the
current language, but queries are just... queries that you have to run and
retrieve results (I'm ignoring SQL-LINQ now)).

~~~
wvenable
The question is what is this doing:

    
    
        sum(amount in db.filter(type='order', date=xxx))
    

If you're iterating every order in the database to apply your filter or to
calculate the most popular items then you wasting a huge amount of processing
power and RAM for something an RDBMS can do very efficiently. It's not an
advantage that your incrementing values in your programming language of choice
either, it's a disaster.

As for API, you can hide the SQL pretty well behind an abstraction as well.
But there is an advantage to using SQL itself -- the database engine analyzes
your query and produces the most optimal way to get the answer you want.
You're choosing to write poorly optimized code instead.

~~~
smcq
You don't seem to understand how stored views work. This computation cost is
felt at insertion time and is not theoretically worse than the exact same
computation cost at insertion time in a rdbms to build almost identical index
structures.

What current _implementations_ of rdbms's gain you is the ability to write
completely ad-hoc queries and get reasonable performance most of the time.
This is an implementation advantage, not a theoretical advantage.

~~~
wvenable
I understand that you can have indexes -- which, in my opinion, seems to
defeat the purpose of NoSQL in the first place. You take your unstructured
data store and structure it.

Also, in the authors example the items are properties of the order. How
exactly would you index on those?

------
wvenable
From the article, on where relational databases are appropriate:

 _> Small records with complex, well-defined, highly normalized
relationships._

Why do the records need to be small? And honest, in software development, a
large amount of your data is going to be well-defined and easily normalized.
The author provided 2 examples that would fit perfectly in a relational
database.

 _> The type of queries you will be running on the data is largely unknown up
front._

Or the types of queries you are running are more than just retrieving a single
record or simple list of records. I'm afraid a very large number of queries
fall into this category.

 _> The data is long-lived: it will be written once, updated infrequently
relative to the number of reads/queries, and deleted either never, or many
years in the future._

Yes, a relational databases are for storing long-lived data. For temporary
data, you could use an in-memory table or just some other solution entirely.
There's no need to mix your permanent data with your temporary data. Databases
handle writes and deletes extremely well (in bulk even) so I'm not sure what
the author was getting at here.

 _> The database does not need round-the-clock availability (middle-of-the-
night maintenance windows are no problem)._

What kind of middle-of-the-night maintenance does a relational database need?
I've been running at least one database for several years straight without any
downtime or maintenance.

 _> Queries do not need to be particularly fast, they just need to return
correct results eventually._

Relational database queries aren't particularly slow -- in fact, RBMS are
heavily optimized to return data very quickly. In the vast majority of cases,
this is going to be more than fast enough for nearly every application.

 _> Data integrity is 100% paramount, trumping all other concerns, such as
performance and scalability._

Damn straight. I want the data coming from my data store to be 100% correct
always. If I need to trade performance for correctness then I can easily add
some caching. But I'm not sure how document stores would solve this any
differently.

~~~
Retric
I can only assume the reason this was at negative 1 was most people interested
in this topic are already on the (not just SQL) bandwagon. IMO, scaling is a
non issue for most well designed websites and as computers get faster this
only becomes more apparent. There is a significant advantage to separating
complex sites into independent modular components and a only tiny fraction of
sites need to scale beyond this point. When you actually need to expand fine,
go down that rabbit hole but, for most people it's a complete waste of time.

PS: I suspect the main problem developers actually have with SQL databases is
they there ORM is significantly less powerful than SQL. All to often
developers focus on row as object and forget the power of more abstract data
structures.

------
jmm
I think more specific context on "SQL databases don't scale" might be
appropriate here, especially for the noob hackers that are figuring out just
which technology to use in building their first or second apps. Is it worth
ignoring defaults and conventions on a simple Rails app and not use an SQL
based db? Probably not? And I reckon that Heroku itself hosts a good deal of
apps where "will this db scale?" isn't really relevant. I could be wrong.

Another question worth knowing the answer to is how much of hassle it is to
switch horses midstream (from a well normalized SQL db to something else),
after there are some data in the tanks.

Maybe this is Adam's recommendation specifically to the dreaming-big
community, which I can certainly appreciate. And maybe everyone should be
dreaming big.

------
gridspy
Gridspy monitors power usage in real time. My original plan was to effectively
stream the live data through the database as it passed from collection
(sensors) to display (on the website). The frequent writing to and polling of
the database to keep data "live" was absolutely killing performance.

Now I have moved to a Twisted application that aggregates the data and does
occasional writes into the DB. It can answer webserver queries for the latest
data out of its internal datastructures and streams live data to the user's
browser via Orbited

See <http://blog.gridspy.co.nz/2009/09/database-meet..>. (the database side)

and <http://blog.gridspy.co.nz/2009/10/realtime-data..>. (the whole
application structure)

[I posted this on the original site too]

------
simonw
I liked the definition of "transient data" - I've been promoting Redis for
that kind of thing but I didn't have the vocabulary to explain what I meant
(I've been staying "stuff like stats and counters"). Defining transient data
as being frequently written, infrequently read is useful too.

~~~
jazzychad
I have been describing this type of data as "high velocity" to my friends, and
they seem to get it.

------
pvg
Hang on, SQL databases are poor at highly structured things like an order and
that's the sort of thing suitable for a document database?

------
CaptainZapp
It doesn't feel the author has real world experience on relational databases.
Let's see :

"Small records with complex, well-defined, highly normalized relationships."

Maybe, but not necessarily. Example? Call Data Records for a phone billing
application. While there may be relationships in the inserted rows, even to
other databases (i.e. customer information) the data stands pretty much on
it's own. The challenge is to get floods of data into a single table, so that
it can be later sliced and diced for billing purposes.

"The type of queries you will be running on the data is largely unknown up
front."

Excuse me? This is so very wrong. I don't even no where to start. _EVERY_ good
RDB application is designed in a way where (ideally) all queries are known up-
front. If you have to guarantee response times (i.e. think of a cash
withdrawal at an ATM) you absolutely _MUST_ control the queries that run on
the db. Ad-hoc analysing and reporting _MUST_ be relegated to dedicated,
possibly replicated databases.

"The data is long-lived: it will be written once, updated infrequently
relative to the number of reads/queries, and deleted either never, or many
years in the future."

Mostly so, but absolutely not necessarily the case. I work on an application
where the entire data is toasted after a couple weeks and in fact: today and
yesterday would suffice.

"The database does not need round-the-clock availability (middle-of-the-night
maintenance windows are no problem). Queries do not need to be particularly
fast, they just need to return correct results eventually."

This is so full of crap, I won't even get into it

"Data integrity is 100% paramount, trumping all other concerns, such as
performance and scalability"

Yes, 100% integrity is paramount, but most certainly not at the cost of
scalability, let alone performance.

Recently, methinks, there are a lot of proponents of new and improved data
management capabilities, who see their little walled environment, but seem to
have no whatsoever experience running databases in a real business. A normal
(even big, huge or multinational business) does not have those "cloud-data-
management-requirments" that very, very few companies really have.

------
blasdel
Consider the source: Heroku needs an alternative to Postgres to be a true
competitor to AppEngine.

There's something that never gets brought up in these NoSQL discussions: SQL
Databases don't scale _down_. They aren't very good in multitenant situations
where you have a lot of random small-fry users -- you end up just sharding the
users across a bunch of different master-slave pairs, and hope that they don't
step on each other's toes. Because they take up real resources even if not
being used, it's difficult to pull off a freemium model.

~~~
silentbicycle
SQLite scales down fine, FWIW. It's just not appropriate past a certain amount
of concurrent writes, but it's great for prototyping and smaller services.

~~~
blasdel
That is a clever idea I'd not thought of, but it isn't really multitenant --
it just pushes the problem down into the filesystem.

It also requires a discontinuous transition to a different SQL database once
you graduate from being a small-fry, and from there you're in the same boat as
everyone else trying to scale that to multiple machines without application
changes.

~~~
silentbicycle
I'm not sure what you mean by, "it just pushes the problem down into the
filesystem". Could you explain?

On the upside, SQLite will take disk space, memory, and CPU proportional to
_actual_ usage, and compared to (say) Ruby or Python, it's a drop in the
bucket.

As to the discontinuous transition between it and a bigger system, sure. It's
a trade-off. So is worrying about scalability plans during early prototyping,
though.

~~~
blasdel
If you're building a multitenant distributed system like GAE, the SQLite blob
needs to be stored in a distributed system too, even if you don't need to
handle concurrent access.

~~~
silentbicycle
Got it. SQLite also explicitly notes that its locking does not work properly
with many broken implementations of NFS (<http://sqlite.org/faq.html#q5>), so
that's not an option.

------
asolove
Although this topic is frequently discussed on HN, and the portion about
document stores is well-hashed, I found the tidbit on binary assets/file
uploads very interesting, as I had only used disk before.

~~~
gte910h
Yeah, I was happy to find out about Tokyo Tyrant as well, had never seen it
mentioned.

------
peterwwillis
There is no explanation of why a filesystem is a bad place to store "binary
blobs".

If you're collapsing the metrics that you're storing _IN SQL_ there is
something _really_ wrong going on.

Logs are OK to store in SQL, assuming you're scraping your logs properly and
are logging the proper things. Logging every clickthrough in a relational
database is somewhat insane. Logging 10 minutes worth of aggregate
clickthroughs is perfectly fine. If you think logs should be a ring buffer I
challenge you to tell that to any admin of a system that is subject to laws
governing the length of time you must store logs (which is pretty much all
e-commerce?).

~~~
dangrossman
Meh, I record a couple million page views to a MySQL database (as a new row
for every page view) every day, for 6 years. Works fine. I don't expect a
million new users to show up any time soon.

~~~
peterwwillis
six billion five hundred seventy million rows?

~~~
dangrossman
Yes, I do actually have billions of rows. See <http://www.w3counter.com>

------
jcapote
I agree wholeheartedly with this article except for the part about
s3/cloudfront being the only tool for storing binary assets. MongoDB's GridFS
extension does a fine job of handling files in the database since they shard
and replicate just like database records. As for serving them out, you can run
it on a cheap unmetered server using nginx + <http://github.com/mdirolf/nginx-
gridfs>

------
cookiecaper
I can feel the sincerity of the author, but ultimately I think he's wrong that
the solution to this is NoSQL.

It really would be nice to send an e-commerce order as JSON data and have my
database know what to do with that. I think we still need the flexibility and
power of a relational database behind it, but if someone extended PostgreSQL
to take records the way CouchDB or others take records, and taught it how to
store into rigid, joinable, relational tables, that would be just great and
would help a lot. All of the advanced and relational functionality would still
be available when needed, but by default, if one could write and retrieve data
in a default format that had been mapped onto tables, etc. previously
transparently from the database, that would be awesome.

~~~
oomkiller
I think you just described inserting to a SQL view. Never done it myself, but
sounds helpful.

------
sant0sk1
Does Heroku offer any alternate persistence engines? I thought they were
PostgreSQL only.

------
davidnelson
I love the document record pattern. I use this on google app engine, which is
a tad tricky to use as a relational db. It actually ends up being way simpler
and faster. For instance, you can model an app with just one "table" (Kind).
You can put say a user id primary key, then have blob binary data fields which
store your serialized document records. The reason it's so fast is all you're
doing is a simple index scan to lookup all the data you want. It also makes it
really simple to store items into memcached/memcache.

------
thewileyone
I disagree with the comment "SQL databases don't scale". If you've got a
decent database architect, scalability (not capacity), stops being an issue.

------
marshallp
These noSql people are missing the point of relational modeling, that you can
easily incrementally evolve your data model. It's why object databases never
caught on.

SQL databases are absolutely beautiful and elegant when you think of them in
terms of the codd relational model, in my opinion the best thing that computer
science has produced so far.

The only limitation of relational databases currently is their lack of
automatic infinite horizontal scaling on commodity servers, but hopefully
someone will solve that soon.

~~~
peterwwillis
if by "solve" you mean "completely redesign modern RDBMS's" you may be waiting
for a while. i'm no expert, but my limited understanding is that most of these
services provide a singular interface to a database. to be "automatic infinite
horizontal scaling" they'd need to support an infinite number of interfaces on
any of these commodity servers. i could be wrong, because again i don't fully
understand them, but i think that means basically implementing the equivalent
of an 'intelligent' distributed parallel fault-tolerant file system. at that
point i figure you could add "lots of glue" (heh, understatement?) to
MemcacheDB, MemcacheD, MemcacheQ and some other crap and get something
similar. but i could be crazy.

~~~
marshallp
The google appengine datastore is moving in the direction of being a full sql
database, and I bet oracle have people working on it (because appengine,
salesforce and aws is a threat to their business), so I don't think it's all
that far-fetched.

~~~
csytan
Appengine's GQL is a very limited subset of SQL. The only queries which can be
performed are those that can scale well, so it's doubtful that even simple
things like JOINs will be supported in the future.

[http://code.google.com/appengine/docs/python/datastore/gqlre...](http://code.google.com/appengine/docs/python/datastore/gqlreference.html)

~~~
marshallp
Appengine datastore works by creating indexes for every query. In relational
databases you make joins execute fast by creating indexes. I don't see why
joins couldn't be added to appengine.

~~~
davidnelson
I believe there's a few reasons. one is that the google file system blocks
that bigtable runs on and "reads" for your query could be on any number of
machines. been a while since I read the google architecture papers but I would
recommend it to all. it is fascinating how they implemented gfs, bigtable,
chubby lock, etc. the work around is essentially to write your own code that
does what a database does - something to the effect of getting a list of keys
for the records you are interested in and using that to then scan just the
area in your data structure you want.

~~~
marshallp
It hasn't got anything to do with that. Creating an index on a join means that
possibly hundreds of indexes need to be updated every time a you make a write
- but that is where appengine would actually shine since those writes can be
parallelized to hundreds of machines. It could easily make
postgres/mysql/oracle redundant.

------
w3matter
Yeah, Heroku should put up or shut up until they start offering alternate
NoSQL databases (and no, MongoHQ is not an alternative because of network
latency).

Interesting article though.

~~~
cracell
As they are hosted on EC2 couldn't you use Amazon Cloudfront with them without
too much trouble and with no latency issues?
<http://aws.amazon.com/cloudfront/>

The main issue would be integrating Cloudfront with Rails not getting it to
run on Heroku. (A quick search doesn't show much support for using Cloudfront
with Rails in general)

