Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask YC: Any hackers who can provide advice on scaling (esp database)?
8 points by shafqat on March 24, 2008 | hide | past | favorite | 8 comments
Particularly, how well symfony scales(using Propel ORM layer)? We've built our startup using Symfony, and it allows us to iterate quickly, but when it comes to scaling, especially database connections, are there any best practices out there that we should be aware of now?

I'd rather hear about people's experiences and tweak now rather than get screwed later! Thanks!



Some things I figured while working on http://thefeelgood.com

1) Move as much of the work from MySQL to PHP. Things like joins, sorting, etc. are done in PHP. Your PHP nodes are much easier to scale - just throw in a new server when you need it. It won't fix all your scaling problems, but will give you much more time before you need to scale your MySQL.

2) Use memcache for at least the most frequent queries, definitely for everything on the homepage. If invalidation of something gets too hard just use a 10-second expiration or so.

2a) To avoid dogpiling memcache, measure your peak rate of hits on the site. Then, rewrite your cache get function like so:

  if (mt_rand(0,$peakRate) == 1)
      return false;
  else
      // get things from cache
3) Think about separating queries in two steps. The first just fetches a list of IDs based on a certain condition, the second fetches the actual items. Cache both steps, with individual cache entries for each item. Make sure you use multi-get.

4) Try to design composite keys instead of autoincrement. For example, for The Feel Good, the post ID is YYYY-MM-DD-U (U = user ID), so just by analyzing the IDs in PHP, I can sort by date, filter future posts (you can schedule posts for tomorrow), analyze user-to-user relationships, etc.

5) Don't be afraid to go around the ORM and write SQL by hand, especially for tricky UPDATEs.


Your question's is a bit too broad, but let me tell you what I know. The first question you need to ask yourself is: are you read-bound or write-bound?

A "generic" advice for databases is that you -don't- want to have multiple small boxes when one big box will do. Managing many machines is complex both from the sysadmin and developer points of view. Instead, get a machine with as much ram as you can and disks as fast as possible (raid 0 really helps), cpu doesn't matter too much for databases. Obviously, don't expect to run your DB on a i486 ;-)

If your box can't cope with writes (disk too slow), you'll need to either increase the write speed of your machine again (more disks in a raid 0 array) or "shard" your data. You want to avoid that as much as you can, not fun.

If your box is read-bound (ie: is slow at reading), then you can start thinking about replication to multiple machines.

But quite frankly, I think you got it all wrong. Don't bother about scaling and improving performance now because it's almost impossible for you to know where the bottleneck is going to be anyways.

I've never used Symfony, so I don't know anything particular about it, but I don't think the framework is going to cause problems database-wise.


I've worked a bit with Symfony and I've worked a lot with heavily loaded LAMP deployments. Symfony/Propel (and I hear Drupal/Propel) tends to generate a whole lot of select statements when you're allowing it to generate scaffolding code.. So this leaves you with a couple options:

i) re-factor auto-gen code when it becomes a problem. A good code profiler will really help you find these bottlenecks but you can also watch 'show processlist' from a mysql CLI to see slow queries or table locking issues if you're using MyISAM.

ii) Memcache - memcache caches query results and keeps the most heavily requested results cached, lets less heavily requested data fade (it will go back to the database to get it rather than keep a copy of the result around.) obviously if you're dealing with a 'real time' application caching may not be of huge benefit.

Jeremy Zawodny has a pretty good article here: http://jeremy.zawodny.com/blog/archives/000173.html

About setting your wait timeouts and connection limits, etc. It's good stuff and it's something I see overlooked very often. The default mysql config isn't necessarily a your friend when it comes to scaling.

Other points above relating to 'spindle speed' when it comes to hard drive performance etc, also apply. One other good rule of thumb is that MySQL loves RAM. Add RAM and you generally add performance.

Hope that helps. :)


I am assuming you are using MySQL. If so, there is a very nice mysql performance blog. Definitely check that out. There is also a Mysql tuning book by mysql press itself and then there is Mysql High performance Oreilly book as well.

Plus, there are many simple rules to keep in mind as well. Use innodb when transaction (acid-properties) is important, but MyISAM when reads needs to be efficient. Normalize first and then denormalize (and duplicate data) later when you hit a wall. From the beginning, start with the assumption that there will be many databases and there will be many servers. There is also memcache which is just plain awesome.

Lastly, seriously think when and where you really need a db and where you can just use pickled-data/file-systems. [PG has said this many times as well].

Hope that helps.


Sorry, we're using MySQL - should have specified. Thanks for all the tips so far.

We're definitely exploring all caching options. Any suggestions for database load balancing/building out the DB farm? What about database sizes - we're probably adding 10,000 records a day. Thinking about using Amazon WS for storing all old data.


Cache cache cache. Pre-cache, serve slightly stale cache, write your own data structures to store cache data in special ways.

The obvious method is to use mem-cached, but beyond that, nothing seems to be standardized. Use every trick you can think of.


One word: Indexes. See what queries are taking the most time, and then add indexes to them. Can speed up your app by 1000x if there is one query that is running wild.


What database are you using?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: