

Scale Cheaply - Sharding - dhotson
http://codebetter.com/blogs/karlseguin/archive/2008/06/30/scale-cheaply-sharding.aspx

======
mdasen
Sharding is a great way to scale, but it has problems - specifically, you
loose easy joins. With the example mentioned, you have a client where all
their stuff is on one box - think a hosted CMS. Wonderful!

This doesn't apply as neatly to something like Twitter. There, we have all our
users potentially needing to be joined to data on all our other users. Joins
don't go across shards. You loose that with such a system. So, for something
like Twitter, you can either run lots of batch jobs and deal with the latency
that batching implies (not an option for a service that is used for instant,
small updates), use a single DB (replicated still applies as single in this
case), potentially use something like memcached to store a giant hash table of
the relationships and updates for the past 24 hours (and the database only
becomes long-term storage).

Sharding is awesome. Sharding has great uses. Sharding isn't something most of
us will have to use.

~~~
paul
Twitter is no harder to partition/shard than Google is. (your search is
against the entire web, not just one part of it) Batching has nothing to do
with this issue.

------
DenisM
If you are sharding, I recommend "distributed hash table"
<http://en.wikipedia.org/wiki/Distributed_hash_table> with a twist - make sure
each physical machine participates in the table many times (1000). This way
you can relocate small pieces of data to balance the load without affecting
other machines.

