
Good Advice on Keeping Your Database Simple and Fast - don420
http://www.allthingsdistributed.com/2009/03/keeping_your_database_simple_and_fast.html
======
CalmQuiet
Pet peeve: why use a _video_ to convey a couple simple suggestions (one of
which is "use S3" - not a huge surprise since this is from the blog of Amazon
CTO Werner Vogels)?

The other suggestion is more interesting, and one I been contemplating several
ways of implementing: off-loading _very_ large (proportionally) text fields
into files (one raw; one html-tagged; etc) simpling _linked_ in in a db field.

Anybody else had a project where that was applicable? effective?

~~~
mixmax
Asking out of ignorance: If you move your large textfields out of the database
and into a file how will that improve your performance? Yes the DB will be
smaller and thus faster, but won't it take proportionally longer to load your
text from a file somewhere thus nulling your DB speedup?

~~~
gcv
So the primary reason for moving stuff out of the database in this particular
example is not performance, but storage constraints. Relational databases are
notorious for being exceedingly difficult to scale beyond a dataset which fits
on one server. If you have a giant blob stored in every row, you'll grow out
of the database's one-server storage capacity much more quickly than if you
take that blob and store it someplace else. If your data row instead includes
columns 'server' and 'path', you can easily distribute your blobs across
multiple servers.

In the example in the video, the team took the concept a step further and (1)
stored the blobs in S3, and (2) served them to the users straight out of S3.
That way, they pay for S3 storage, but not for their own machines for storing
or serving up those blobs.

~~~
trezor
_Relational databases are notorious for being exceedingly difficult to scale
beyond a dataset which fits on one server._

Factually incorrect. This may apply to MySQL, but that doesn't translate to
the rest of the RDBMS-world.

I wonder if I should start calling this point of view "MySQLism", as it seems
to be a theological dogma, with no basis in reality whatsoever, echoed
endlessly in the MySQL world.

For instance: I've managed database servers with mere 8GB of RAM, with
hundreds of concurrent connections handling datasets of several hundred
gigabytes, and they ran just _fine_. The load wasn't even particularly nasty,
with CPU hovering at around 10% and no disk-queueing at all.

Ofcourse I chose to use something _better_ than MySQL, which might explain my
success.

~~~
gcv
That sounds terrific. Please provide more details about the setup you used.
Which database product? Does that product support splitting tables across
multiple servers? Does it let clients transparently query across all of them?
Which features did you enable? Did you shard the schema and then made the
application layer figure out which server to hit?

It sounds like you did not use a SAN or Oracle RAC or anything else which
requires exotic and expensive hardware. Please confirm this, and also, if you
used non-open-source software, could you provide a ballpark cost estimate?

~~~
trezor
The complete setup was a 2-node active/passive Microsoft SQL Server-cluster
solution with SAN backing, hosting various systems across 5 different SQL
Server instances.

For clarity and to avoid confusion, since Linux-clusters and Windows-clusters
are quite different: An active/passive windows cluster does not distribute
load among nodes. A 500GB workload attached to one instance on one node will
only have that node to work on. The remaining node will only be for failover.
So despite this being a 2-node cluster, it can easily be pictured as a 1-node
setup for all practical intents and purposes.

Also SAN backing is required for Windows clusters, and hence can't be avoided.
For perspective the amount of actual disks used in this setup was far from
extreme. I'm pretty sure we are mostly talking about a few LUNs running
something equivalent of RAID5 and RAID0 on regular 10k SAS disks. Nothing
fancy.

No schemas were sharded as that completely breaks referential integrity and is
something only done in the MySQL-world and essentially kills the ability to do
efficient joins. No special features apart from indexing and use of file-
groups was enabled, but if disk IO or CPU had been a problem, you could very
simply have setup partitioned tables or distributed partitioned views between
linked servers to transparently involve more servers and distribute load.

No front-end load-balancing was used, so no client-redirection would be
necessary, although Windows clusters allows client-redirection in case of
failover. No client configuration apart from the server's cluster-name is
needed.

Products used: Microsoft Windows Server 2003 Enterprise Edition and SQL Server
2005 Standard Edition. Cost of these products will most likely vary depending
on what kind of existing licensing deal you have and where you live.

My estimate would be that price for this setup is rather insignificant
compared to the amounts of data it can seamlessly handle without any developer
effort wasted on scaling. No effort what so ever.

~~~
gcv
Interesting. So you had a SQL Server instance, paired with a failover backup
on hot standby (using Windows clustering), serving up data from a bunch of
disks attached together in a SAN, right? Which SAN products did you use, and
how much did they cost?

~~~
trezor
This was a HP EVA San solution with fibrechannel interfaces and hence somewhat
costy, but it was used for the entire organization (file-clusters, VMWare-
clusters, Exchange-clusters etc etc), not just the SQL Clusters.

Apart from that teenie weenie bit about having an officially supported
clustering-setup, I see no reason why you couldn't have used some random Linux
iScsi NAS-solution, for instance powered by OpenFiler, which is 100% free.

