

Ask YC: Any hackers who can provide advice on scaling (esp database)? - shafqat

Particularly, how well symfony scales(using Propel ORM layer)? We've built our startup using Symfony, and it allows us to iterate quickly, but when it comes to scaling, especially database connections, are there any best practices out there that we should be aware of now?<p>I'd rather hear about people's experiences and tweak now rather than get screwed later! Thanks!
======
andr
Some things I figured while working on <http://thefeelgood.com>

1) Move as much of the work from MySQL to PHP. Things like joins, sorting,
etc. are done in PHP. Your PHP nodes are much easier to scale - just throw in
a new server when you need it. It won't fix all your scaling problems, but
will give you much more time before you need to scale your MySQL.

2) Use memcache for at least the most frequent queries, definitely for
everything on the homepage. If invalidation of something gets too hard just
use a 10-second expiration or so.

2a) To avoid dogpiling memcache, measure your peak rate of hits on the site.
Then, rewrite your cache get function like so:

    
    
      if (mt_rand(0,$peakRate) == 1)
          return false;
      else
          // get things from cache
    

3) Think about separating queries in two steps. The first just fetches a list
of IDs based on a certain condition, the second fetches the actual items.
Cache both steps, with individual cache entries for each item. Make sure you
use multi-get.

4) Try to design composite keys instead of autoincrement. For example, for The
Feel Good, the post ID is YYYY-MM-DD-U (U = user ID), so just by analyzing the
IDs in PHP, I can sort by date, filter future posts (you can schedule posts
for tomorrow), analyze user-to-user relationships, etc.

5) Don't be afraid to go around the ORM and write SQL by hand, especially for
tricky UPDATEs.

------
cmer
Your question's is a bit too broad, but let me tell you what I know. The first
question you need to ask yourself is: are you read-bound or write-bound?

A "generic" advice for databases is that you -don't- want to have multiple
small boxes when one big box will do. Managing many machines is complex both
from the sysadmin and developer points of view. Instead, get a machine with as
much ram as you can and disks as fast as possible (raid 0 really helps), cpu
doesn't matter too much for databases. Obviously, don't expect to run your DB
on a i486 ;-)

If your box can't cope with writes (disk too slow), you'll need to either
increase the write speed of your machine again (more disks in a raid 0 array)
or "shard" your data. You want to avoid that as much as you can, not fun.

If your box is read-bound (ie: is slow at reading), then you can start
thinking about replication to multiple machines.

But quite frankly, I think you got it all wrong. Don't bother about scaling
and improving performance now because it's almost impossible for you to know
where the bottleneck is going to be anyways.

I've never used Symfony, so I don't know anything particular about it, but I
don't think the framework is going to cause problems database-wise.

------
bap
I've worked a bit with Symfony and I've worked a lot with heavily loaded LAMP
deployments. Symfony/Propel (and I hear Drupal/Propel) tends to generate a
whole lot of select statements when you're allowing it to generate scaffolding
code.. So this leaves you with a couple options:

i) re-factor auto-gen code when it becomes a problem. A good code profiler
will really help you find these bottlenecks but you can also watch 'show
processlist' from a mysql CLI to see slow queries or table locking issues if
you're using MyISAM.

ii) Memcache - memcache caches query results and keeps the most heavily
requested results cached, lets less heavily requested data fade (it will go
back to the database to get it rather than keep a copy of the result around.)
obviously if you're dealing with a 'real time' application caching may not be
of huge benefit.

Jeremy Zawodny has a pretty good article here:
<http://jeremy.zawodny.com/blog/archives/000173.html>

About setting your wait timeouts and connection limits, etc. It's good stuff
and it's something I see overlooked very often. The default mysql config isn't
necessarily a your friend when it comes to scaling.

Other points above relating to 'spindle speed' when it comes to hard drive
performance etc, also apply. One other good rule of thumb is that MySQL loves
RAM. Add RAM and you generally add performance.

Hope that helps. :)

------
krishna2
I am assuming you are using MySQL. If so, there is a very nice mysql
performance blog. Definitely check that out. There is also a Mysql tuning book
by mysql press itself and then there is Mysql High performance Oreilly book as
well.

Plus, there are many simple rules to keep in mind as well. Use innodb when
transaction (acid-properties) is important, but MyISAM when reads needs to be
efficient. Normalize first and then denormalize (and duplicate data) later
when you hit a wall. From the beginning, start with the assumption that there
will be many databases and there will be many servers. There is also memcache
which is just plain awesome.

Lastly, seriously think when and where you really need a db and where you can
just use pickled-data/file-systems. [PG has said this many times as well].

Hope that helps.

------
shafqat
Sorry, we're using MySQL - should have specified. Thanks for all the tips so
far.

We're definitely exploring all caching options. Any suggestions for database
load balancing/building out the DB farm? What about database sizes - we're
probably adding 10,000 records a day. Thinking about using Amazon WS for
storing all old data.

------
tim2
Cache cache cache. Pre-cache, serve slightly stale cache, write your own data
structures to store cache data in special ways.

The obvious method is to use mem-cached, but beyond that, nothing seems to be
standardized. Use every trick you can think of.

------
nikiscevak
One word: Indexes. See what queries are taking the most time, and then add
indexes to them. Can speed up your app by 1000x if there is one query that is
running wild.

------
emmett
What database are you using?

