

Amazon EC2 + S3 Doesn't Cut it for Real Applications - nickb
http://www.25hoursaday.com/weblog/2007/07/04/AmazonEC2S3DoesntCutItForRealApplications.aspx

======
ed
The title is misleading. The post explains that EC2+S3 does not provide a good
solution for relational databases -- but it's not supposed to (this is a
common misconception).

The author makes a good point but never discusses one of the bigger issues
surrounding EC2 database instances -- network latency. The architecture of EC2
is such that multiple EC2 instances are not guaranteed to be on the same local
network, meaning that you might end up with a database server in Boston and
your Apache server in Seattle. This gets really inefficient if your back-end
code relies on multiple DB queries.

However, EC2+S3 are great at what they're designed for -- providing
SUPPLEMENTAL computing and storage resources. This is ideal for web
applications which need to process file data in the background. Amazon appears
to be working on a service which would be the database analog of S3. Only
after that's released would we be able to judge Amazon as a complete app-
hosting solution.

~~~
joshwa
So then what _is_ the answer to scaling relational databases? Getting rid of
it and writing your own storage engine?

~~~
epi0Bauqu
What are you doing? For most Internet startups, PostgreSQL will be fine. All
you have to do is use indexes correctly and then keep their size under the
memory allocated to postgres. That can really get you a long long way.

~~~
steve
"keep their size under the memory allocated to postgres."

Assuming that you can do that, explain to me how this doesn't _totally_ wipe
away any scalability/performance advantage gained by using slow sql databases.

Well, yes, I guess it depends what you're doing. A speed increase of ~100x and
increased simplicity made the crucial difference in my decision to drop the
database for most of its traditional uses in my app.

~~~
epi0Bauqu
I don't really understand your comment, but I'll just try to just restate more
clearly and then maybe it will answer it anyway. If anyone has a specific
situation, I'd be happy to speculate on it.

By keeping the indexes under the memory allocated to postgres, I mean the
total memory that can be set aside to the DB on the machine. If you look at
the pg data files you will find the indexes are just large flat files like
BerkeleyDB or whatever homegrown file system thing you want to create.

So you do not need to increase the shared buffers to some crazy amount. In
fact, you will have a negative performance result if you do that. The kernel
will automatically cache these files upon repeated use. So if you have a
machine with 4GB of memory and you can allocate 3GB to the db, then these
indexes will just remain cached in memory forever. And you can easily get
relatively cheap machines now with 16GB and higher, so this gets you really
far.

If you only run sql queries that are indexed appropriately and the indexes are
in memory, then I assure you postgres will not be the bottleneck in your app.
This is not hard to do; it just takes a little postgresql.conf tweaking and
data model forethought. In particular, you turn of sequential scans and
decrease the index tuple cost and run select explain on all your queries to
ensure that the indexes are always being used. As for the data model, use the
smallest data types for indexes, and numeric types whenever possible. That
will keep the size down to a minimum. PostgreSQL takes care of the rest.

Again, it depends on what you are doing, but this configuration will usually
tie or be significantly better than some home grown file system thing when you
get some scale because, like I said, postgres primarily uses the file system
memory cache as its memory cache. But if you have some file system process
that is accessing a ton of different files you will get an I/O bottleneck just
from the lookups alone. And if you use a big flat file, you might as well
stick with postgres because of its reliability, security, concurrency, network
and backup features.

That being said, I still wouldn't store big static blobs in the db. I would
just put them on another machine and let apache serve them up statically and
put references to them in the db (indexed by some number id). That way you get
the best of both worlds.

~~~
joshwa
I'll answer my own question:

<http://www.slideshare.net/Blaine/scaling-twitter>

cache, partition, denormalize.

------
wensing
The strongest accusation I can find in this article is that: "it still isn't
fool proof".

To what degree do 'real applications' have to be fool proof?

@joshwa: Thanks for point out the comments--I'm glad I read them, particularly
this portion:

 _And even if the server software were to blow a gasket and call it quits w/o
any warning and w/o rebooting itself, and from a more realistic viewpoint,
even if the hardware were to fail, that still leaves the remaining slave
servers who are at least within a handful of bytes of being accurate at the
point the master server (if, in fact, it was the master server that gave up
the ghost) called it quits. In this regard I fail to see a "this isn't capable
of handling real-world applications" and instead "it's possible that something
bad could happen, so prepare for the worst yet hope for the best and know
that, within reason, it's far more likely that the best scenario (i.e. your
data layer remains persistent as it should) will be what actually takes place
99.9999999999% of the time._

Yeah, so as someone using EC2 + S3 for our web enterprise, I am not really
worried.

------
joshwa
good stuff in the comments there...

