

Mr. Moore gets to punt on sharding - riklomas
http://www.37signals.com/svn/posts/1509-mr-moore-gets-to-punt-on-sharding

======
mdasen
I'm a little surprised that 37signals hadn't sharded some of their
applications. Sharding can be hard because you often don't have a logical way
to separate out data. For example, on a site like this, you could shard
articles based on year, but then if you wanted to get all the articles that a
single user had submitted, you'd have to query several servers.

However, the inherent design of 37signals applications make sharding a lot
easier. With 37signals apps, you're not dealing with all the users in the
system or all of the data. You're dealing with one company's data and the
users from that company. You could much more easily put data from companies
with names A-I in DB1 and J-Z in DB2 and call it a day than most sites can
shard. You wouldn't have to worry about things like losing cross database
joins or having to lookup in multiple databases for certain reads.

In fact, with Basecamp, there's 6 possible domains that your site can be a
subdomain of (projectpath.com, clientsecton.com. . .). Each one of those could
point to a different set of app/DB servers. If you ran metrics on usage, you'd
have to run it 6 times and aggregate the data, but in terms of user usage,
each could be its own little separate system not even knowing the others
existed.

Sharding becomes a problem when you can't fit all the data that will be used
together on a single server. For example, Facbeook could shard by college, but
you're allowed to have friends at different colleges from your own. So, you
have to do things like cross shard lookups and that mess. With 37signals
products (to my knowledge), everyone gets an account with your company's site
and interacts with your company's stuff and if they want to interact with
another company's stuff they log out and log back in with a different
username/pass. So, everything can be neatly sharded by company since you don't
have things that cross that boundary.

~~~
jhawk28
Not sure how one would shard the data based on a RoR app.

~~~
mdasen
Hypothetical: So, you have 26 companies using Basecamp (Company A-Company Z).
Your single database server can't handle all 26 companies so you decide to
shard them.

Your previous setup: Webhead1-|- Database1 Webhead2/

So, you have two webheads reading from one database. You need to shard and
have a second database: Webhead1--Database1 (Companies A-M) Webhead2--
Database2 (Companies N-Z)

Easy! Think of it this way: because a person at Company A is never accessing
Company Z's data, there doesn't need to be any communication between the
databases. So, it shards in the same way that Wordpress shards: different
sites are on different servers.

With Basecamp, you don't need every company to be on the same server because
you don't move data between companies. But you don't need a separate server
for every company either. So, you can put a bunch of companies on the same
server and a bunch on another server and you're done. There's nothing magic or
RoR specific.

------
smoody
The one thing to keep in mind with 37signals advice: I assume they are
building-out and maintaining their own boxes (TaDa List being the one known
exception). Could be wrong about this, of course, but not all startups have
the luxury to do that. If one can't afford to rent or build-out boxes stuffed
with lots of ram goodness and super-efficient raid setups (perhaps 37signals
has a dedicated highly optimized RAID-10 file server? that would also play
into their ability to run from a single server), then sharding becomes much
more important much earlier in the growth curve. I've always viewed sharding
as a poor-man's approach to scaling (with the exception being rich-man's
boundary condition sites such as flickr and facebook that have absolutely no
choice). And, of course, let's not forget that memcached can play an important
role for most people. The number of read queries they're executing might be
90% fewer if they're utilizing memcached properly.

------
andr
So Basecamp runs on a single database server?!

~~~
dhh
Sure does. There's a backup database that has a live replication feed, of
course, for availability. But for performance, we only need one.

I'd only run one application server too, if I could get away with it. Dealing
with 1 is much easier than dealing with many. Do it as long as you can!

------
trezor
_So as long as Moore’s law can give us capacity jumps like that, we can keep
the entire working set in memory and all will be good._

I'm not going to say "this guy does not know what he is talking about", but
what kind of shitty database are you running if you need to have the entire
database cached in memory?

Isn't the point of having a proper RDBMS that it should be able to handle data
more efficiently than what would be "logically" possible?

I've run databases with 100s of gigs of content on a 2-node cluster where each
node "only" had 8GBs of RAM and performance nor memory usage wasn't an issue
at all.

 _One point of real pain we’ve suffered, though, is that migrating a database
schema in MySQL on a huge table takes forever and a day._

Oh right. MySQL. That explains it.

~~~
chrisbolt
That's great if your data access is localized or infrequent enough that you
aren't exceeding your disks' I/O limits, or if you have lots of I/O capacity
by adding lots of spindles, but even adding spindles isn't going to be as good
as adding RAM. Sharding just makes each shard's local RAM cache more
efficient, and 37signals is choosing to have one 128GB server instead of 4
32GB shards, which is great as long as Moore's law keeps up with your growth.

~~~
trezor
My point was more that adding RAM isn't to be as good as having an actually
good DB which doesn't rape your disk and CPU by using inefficient lookup
methods, employing overly broad locks and which fails at transactions.

His IO-problem is because he is using a shitty database. Only in the MySQL-
world is it common to need the same amount of RAM as the dataset you are
working on.

And if you are going that way memory-wise, you might as well you use an in-
memory flat textfile or persisted objects.

I'm not saying having excess RAM for caching is bad, because that is obviously
silly, but when you need to have the entire DB in ram, you should realize
something is horribly wrong.

