

Things I wish we had known about scaling - martinkl
http://martin.kleppmann.com/2014/03/26/six-things-about-scaling.html

======
danudey
> In a database like PostgreSQL or MySQL, each client connection to the
> database is handled by a separate unix process

This isn't correct. They're often handled by threads, but not necessarily 1:1.

> Every connection adds overhead, so the entire database slows down, even if
> those connections aren’t actively processing queries.

I've never known this to be true, even in large production systems. Can anyone
cite?

> Partitioning (sharding) and read replicas probably won’t help you with your
> connection limit, unless you can somehow load-balance requests so that all
> the requests for a particular partition are handled by a particular server
> instance

Sharding and read replicas are two very different ways of handling data; the
issues cited as problems only affect sharing and not read slaves.

> That’s all doable, but it doesn’t seem a particularly valuable use of your
> time when you’re also trying to iterate on product features.

If you can't scale you database, then adding more functionality is a bad
thing. Software engineering doesn't stop once you make an API call to someone
else's software.

> in order to set up a new replica, you have to first lock the leader to stop
> all writes and take a consistent snapshot (which may take hours on a large
> database)
    
    
        mysqldump --single-transaction --master-data
    

You can even gzip this on-the-fly and stream it via SSH to the new server to
avoid disk I/O on the local machine competing, or even connect remotely from
another server via mysqldump to avoid the SSH overhead.

------
mysteriousllama
To add to this: Cache and cache invalidation.

Without proper caching and a good invalidation strategy your databases will
get pounded. Use redis and memcache to cache everything possible. Don't even
connect to the database unless you have to. Ensure that you can invalidate any
cache entry easily and keep things atomic so you do not run in to race
conditions. Use locking to ensure that when the cache expires the database
does not get a dog-pile with multiple copies of the same query. You'd think
the query-cache in your database of choice may be just as efficient but trust
me, it is not even close. You can also cache higher-level objects than just
simple queries.

Depending on your reliability requirements you may even consider treating your
cache as writeback and doing batched database writes in the background. These
are generally more efficient than individual writes due to a variety of
factors.

I've worked on several top-2oo ranking sites and this has always been one of
the main go-to strategies for scaling. Databases suck - Avoid querying them.

------
s0enke
Since MySQL 5.6, most schema changes (add/drop field) are possible online as
well:

[http://dev.mysql.com/doc/refman/5.6/en/innodb-create-
index-o...](http://dev.mysql.com/doc/refman/5.6/en/innodb-create-index-
overview.html)

------
dalek2point3
"Database connections are a real limitation"

I repeat

"Database connections are a real limitation"

this should be one point, and the others should be one point. No?

