

Top 10 Reasons to Avoid the SimpleDB Hype - RyanGWU82
http://www.ryanpark.org/2008/04/top-10-avoid-the-simpledb-hype.html

======
jsjenkins168
All valid points _if_ compared against a single-instance RDBMS. Except that
modern RDBMS used in any large scale web service will also be guilty of these
points. When you shard/cluster your data across various master/slave DB
instances you essentially lose the ability to store your data in a non-atomic
normal form, just like SimpleDB. In both a sharded RDBMS and SimpleDB your
data is denormalized.

With a DB setup like this (and all major web sites are) you cannot feasibly
join, group by (or do associated aggregations), or order by anymore anyway.
Constraints are not really a good idea either. It would be nice if SQL was
still useful in a situation like this but it just is not.

The days of hugely massive single machine databases is coming to and end if
not already dead. Even if you are reporting on massive amounts of data and
need to do aggregations across large datasets, then OLAP is increasingly
becoming the way to go (again, with fact tables and cube partitions sharded
across multiple instances to increase query performance).

~~~
mechanical_fish
Your argument is good, but it also serves to advance the argument that using
SimpleDB for a new web app is premature optimization.

Yeah, when your app has millions of users and massive amounts of data you
might need to adopt sharding or clustering, perhaps giving up the advantages
of a huge single-machine database relative to SimpleDB. And, _once you have
this problem_ , it might be a good time to contemplate rearchitecting your
entire system around something like SimpleDB. But few web apps _start off_
with millions of users and massive amounts of data, and to deliberately avoid
using SQL tools just because they won't scale to Facebook-esque levels of
traffic is the very definition of "premature optimization".

------
tx
Finally, someone took time and said it. Moreover, I'd recommend for someone
who didn't have exposure to "big iron databases" to look up performance
benchmarks for Oracle, Microsoft and DB2, they're very popular not only among
these companies but also among big-iron manufacturers like Sun, IBM and HP.
It's crazy what they can do if you use them right, good luck building a
business that will need such transactional capabilities of your DB.

~~~
jrockway
I assume the real appeal behind SimpleDB is that you don't have to buy
servers, disk arrays, backup systems, a data center, 24-hour techs to swap
RAID drives, etc. You just put your data in there and forget about that stuff.

~~~
jsjenkins168
Exactly. And its not just hardware you can forget about. No longer having to
worry about indexing, scheduled backups and database shrinks, temp table
sizes, connection pooling, and all the other nasty stuff that goes along with
maintaining a DBMS is equally as nice.

I've switched my project from a local MySQL instance to SimpleDB and I'm never
looking back. It took some work to simplify my existing schema to be purely
atomic (no joins), but after that you can forget just about everything else.

~~~
jrockway
I would be interested in hearing more about your migration from MySQL to
SimpleDB.

Basically, I am a big fan of normalization (since none of the other ways seem
elegant to me). For example, if I were going to write the schema for a Blog, I
would have an articles table, a tags table, and an article<->tag mapping
table. This means that "show me articles with tag foo" is indexed (and
"faster" than an O(n) search over the articles table).

The only approach I can see with SimpleDB is to tolerate scanning all articles
to search tags (and authors, etc.) or to maintain a separate "document" for
each tag (that links to the articles with the tag). Then I worry about
concurrent updates to a tag ( {foo:[1,2]} gets updated to {foo:[1,2,3]} while
someone else immediately changes it to {foo:[1,2,4]}; the correct result would
be {foo:[1,2,3,4]} but adding 4 overwrites adding 3.)

Anyway, like I said, I would be interested in hearing people's experiences.
Perhaps I am missing something obvious.

~~~
jsjenkins168
With SimpleDB, you can still accomplish many-to-one and even one-to-many. So
in your example schema you described, you could have an attribute (aka column)
in your tags domain (aka table) which holds the ids of the article items (aka
rows) in your article domain that the tag is associated with. Since your data
is denormalized though, you will have many duplicate items if a tag applies to
many different articles. Thats ok though, SimpleDB was designed for this dont
worry about large domains.

So this is how you could get your many-to-one mapping. To determine what
articles have tag foo, you would just do the equivalent of select from tags
where your attribute value matches the item id for the desired article.

It seems unintuitive, but you can actually do one-to-many mappings too with
SimpleDB. So you could say "show me all the tags for article A" without a join
table or duplicate items. The reason this is possible is because SimpleDB can
attach up to 256 values to a single attribute. This obviously isnt possible in
SQL. You just assign references to each tag id for that article to one
attribute. Kinda like putting a list of Strings in a single cell in SQL. Be
very careful with this though, as you cannot exceed 256 values bound to 1
attribute. So no more than 256 tags per article in your example.

But the problem with all this is keeping track of changes. updates and deletes
do not cascade so you have to handle that yourself. And of course you have to
consider eventual consistency if making frequent changes on top of each other.

My suggestion to help reduce the complexity of managing your domain references
is to use this library (if you are using Java that is). It makes SimpleDB so
much more usable: <http://code.google.com/p/simplejpa/>

~~~
jrockway
Thanks.

I hacked together a SimpleDB-alike tonight and played with various index-on-
write schemes. I feel a bit better about the situation now. Of course, now I
just need to make it as scalable as SimpleDB :)

( Example usage: [http://git.jrock.us/?p=MooseX-Storage-
Directory.git;a=blob;f...](http://git.jrock.us/?p=MooseX-Storage-
Directory.git;a=blob;f=t/example-blog.t;hb=HEAD) )

You can tell from the test data that I am in a weird mood, though, so time for
bed :)

------
cstejerean
if you think in SQL every project seems to require the "advantages" of an
RDBMS. But the critics are forgetting that:

a) when you use sharding your RDBMS looses most of the advantages you seem to
like so much

b) SimpleDB and friends not only distribute your data but they allow you to
parallelize your queries. think map/reduce

c) as a programmer I would much rather solve summing and grouping in my code
then learn the subtleties of properly configuring and managing a particular
RDBMS (and re-learn everything when I switch vendors)

I'll agree that SimpleDB is not appropriate for everything, but most of the
perceived problems are really just a lack of creativity.

~~~
donw
I'm not quite sure I understand what advantages are lost by sharding -- would
you mind elaborating?

As far as being able to write parallel queries, this is perhaps true, but
_you_ have to manage the parallelization, synchronization, and all other
manner of nastiness. For data that doesn't require frequent updates (like
Google), this may be a better route, but for systems where consistent data is
important, I would much rather rely on years' worth of RDBMS research. It's
not a matter of a lack of creativity; rather, I would prefer to spend my time
solving problems as they are related to my application, not duplicating the
bits of an RDBMS that aren't present in SimpleDB.

I find point (c) to be a bit surprising -- SQL is based around the concepts of
set theory, and works in a fairly straightforward fashion. While there are
differences between implementations, you certainly don't have to 're-learn
everything' when you switch vendors, and the basic tasks of database
management are not terribly complicated.

And yes, I regularly bounce between about three different SQL implementations
(Postgres, MySQL, and SQLlite). Spent some time with Sybase as well, and I am
not a DBA. There are differences, definitely, but it never took me more than a
day or two to get up to speed, and I am not a genius by any stretch of the
imagination.

------
wanorris
If you needed to, you could solve most of the problems with SimpleDB by
writing a server that wraps it and provides consistency checks, sorting and
such. I was considering writing that for a little while.

Then they announced that they were going to support mountable drives on EC2
images, and it stopped being necessary. When they release that, you'll be able
to load up a more straightforward MySQL or PostgreSQL setup, and use
traditional clustering mechanisms.

Honestly, I find SimpleDB interesting, but I have no idea what I could use it
for without writing quite a bit of middleware to wrap it.

------
rgrieselhuber
Good article. Like anything, there are pros and cons to each approach. I
tested SimpleDB pretty thoroughly for the next version of Gridjit and decided
against using it.

One of the drawbacks to working with these sorts of storage services is the
temptation to create overly-abstract models too early. Abstraction and
metadata-based schemas are great but can also add needless complexity
(ironically) early on.

------
martythemaniak
Excellent article. I've been considering using either SimpleDB or Google's App
Engine, so hearing both sides of the debate is incredibly useful.

------
okeumeni
I love this article, kudos Ryan!

I was about to do similar research before committing our next project of 5TB
data space with SimpleDB, ouch! It would have been a disaster. I will probably
use the S3 just for mass storage nothing else.

We want more of these unbiased articles on HN.

------
systems
i am trying to sum things up

a lot of his problems seem to suggest that SimpleDB require more coding to
contol the data.

the counter argument is, that okay, this is true, but you also don't do any
maintenance ... which seems like a fair trade-off, plus you can always add an
abstraction layer in code that simplify the cooding

so my question becomes, can't one add an abstraction layer on top of simpledb
that makes it act more like a traditional rdbms ?

------
vlad
Thank you.

~~~
vlad
Not sarcastic. :) I was considering the merits of using SimpleDB the other
day.

