Hacker News new | past | comments | ask | show | jobs | submit login
Surviving a traffic surge: Three techniques to scale your site fast (might.net)
91 points by RiderOfGiraffes on Mar 29, 2011 | hide | past | web | favorite | 43 comments

These are decent tips, but the real fix for me when I got a (minor) traffic surge was changing KeepAliveTimeout from 15 (sec) to 2 (sec). Basically, due to the high default keep alive time for requests, most Apache threads were waiting for the timeout.

So the number of threads you have (e.g. setting StartServers, MaxSpareServers/MaxSpareThreads) is way less important than the keepalive timeout: you can/should start enough threads to use all the available resources, but it will only make a difference if you aren't idling all those threads with a high KeepAliveTimeout. Apparently, that's the Apache default setting.

Ding ding ding, we have a winner. KeepAlive can even kill your blog at 2 (achievement unlocked 4 times over last year).

Edit to elaborate:

Sorry, was eating dinner and Kindle is not exactly made for typing on technical documentation. This comment is a abbreviated version of http://www.kalzumeus.com/2010/06/19/running-apache-on-a-memo... -- read that if you want a longer spiel. (It is my most cited blog post on HN. I don't know whether to be happy or sad about that.)

Basically, there are a couple of Apache MPMs available. You may have the prefork MPM installed. You can check by running "apache2 -l". If you see prefork in the output, take a look at your config file (quite possibly /etc/apache2/apache2.conf) and check for the setting" KeepAlive On". If KeepAlive is on, your blog is broken and you just haven't found the failure condition yet.

On my server (Ubuntu, has gone from Dapper to Lucid over the years), the package default Apache2 settings for the prefork MPM are: 15 second keepalive, 150 MaxClients. If your server has enough RAM to support 150 processes for Apache (and, if you're on a VPS, you probably don't), that will let you process a hard theoretical maximum of 600 clients per minute. There are many, many things you can do to exceed that maximum bound: getting on the front page of Reddit or getting retweeted by Jimmy Wales at the right hour of the day both qualify.

Calculation: any client requesting any file, regardless of whether it is dynamic, static, cached, generated by a PHP monstrosity, whatever, occupies one process for a hard minimum of 15 seconds. 4 clients saturate one process for 1 minute.

With special attention to fellow VPS owners: after having died hard several times when apache2 decided to use up all available RAM and then swap the machine to death, I eventually tweaked the MaxClients setting down to 24. This means that, even with KeepAlive at 2 seconds, my max throughput was 720 clients per minute. Again, that number is achievable under very plausible circumstances for a personal blog in 2010/2011.

There are a variety of countermeasures one can take against this. One is not using the prefork MPM, but you have to be a configuration Jedi to figure out how to actually do this and still run PHP on your server. "apt-get install apache2 libapache-mod-php5", which is what substantially all guides will tell you to do, will force you to use the preform MPM. If you had been using the worker MPM instead, you would have a much, much harder time crashing your server serving static content.

Another alternative: switch to Nginx. This problem goes away instantly. (If I didn't have 15 config files I would have to migrate, I would have done this years ago.)

The easiest alternative: turn off KeepAlive. This will give you a very modest throughput hit, but I'll trade "Blog stays up if mentioned in the NYT" for that hit any day of the week.

Do you run your site with keepAlive set low all the time or only when needed?

KeepAlive is off 100% of the time because "Wake up at 3 AM in morning, in response to my cell phone playing Ride of the Valkyries because a server is offline, to tweak the config file locking out thousands of people who want to see my writing" is sensible precisely 0% of the time.

Wow, I was scared he was talking about scaling an application through most of that post. Only to finally realize it mentions its for a blog towards the end.

Its drastically oversimplified if you need to scale an application. "Step 2: Make content static". For an actual application, there is far more to say than 4 sentences.

In my opinion:

* There are few cases where Apache is better than nginx. I don't run PHP, so that may be still there. * Varnish is awesome, use it, love it. * Purely static blogs, like jekyll, are great.

Those tips are obviously worthwhile, but doesn't address massive scale for heavily dynamic sites. These should definitely be at the beginning of any optimization checklist for sure.

One thing I found interesting was his remark this:

Google Analytics failed to detect the surge: page load time was so high that visitors were closing the page before analytics could load.

He then remarks that it took Analytics 15 hours to detect the spike, but isn't that true of all Analytics instances? I'm not sure if the author is mistaking Google Analytics' delayed reporting as a fault or if I'm missing something.

These tips are for people getting slashdotted, not people building the next Google. Your average blog simply isn't "heavily dynamic", and if you're building Google/Facebook/... there are better resources.

Google Analytics is Javascript-based and thus vulnerable to people closing the page before it loads. It's actually near-realtime, but just doesn't work if you're having this problem. I'd imagine watching free Apache instances/threads, outgoing bandwidth or probably even the Apache logs would be more useful.

> Google Analytics is Javascript-based and thus vulnerable to people closing the page before it loads.

And people with Javascript disabled, which is a startling number. The disparity between reported views in Analytics and what I can observe from my Web server logs is amazingly large.

Are you sure those aren't bots? Analytics packages vary widely in their ability to filter out bots.

Bots have Javascript disabled too. :)

You're right, of course. I considered it and started grepping them out, but got bored and did something else interesting. From a casual glance, there were a lot more Mozilla UAs than reported views, though.

While the reason Google Analytics failed at this has already been explained, GA Ecommerce tracking is near real-time, but not error corrected. Data in other parts can take several hours to show up, Google says to wait 24 hours for data to be complete and error corrected.

Omniture SiteCatalyst is near-real time. I don't know if they do some error correction later like GA does for Ecommerce data.

GA data is delayed by a day, but if you adjust the date range manually, you can see the data that has come in so far. Not guaranteed to be complete, but some data.

Google Analytics updates hourly, reporting gives you only full days by default but you can see data much more up to date than that.

600 MaxClients on a Linode box with 512MB RAM? At this point, it became quite clear that this article isn't going to be hugely useful.

These articles about "tuning" Apache are a pretty decent source of amusement, though.

what would you recommend instead?

Apache's memory usage should not exceed the available RAM on the machine. If you do that, it starts having to use swap, which drastically slows things down - if you're already getting lots of hits, it'll start a death spiral.

Apache's memory usage varies based on what modules are enabled and the code they're serving and a number of other factors.

The rule of thumb is to take the average free RAM when Apache isn't running, divide it by the average RAM usage of a single Apache process on your system, and set MaxClients to a couple under that value.

For example, on a 512MB Linode box, if you've got 450MB free when Apache isn't running, and Apache takes up 12MB per process, you'd allow about 35 at the most.

Makes sense, thanks for clarifying!

Here is a good one: http://www.devside.net/articles/apache-performance-tuning

Was written a few years back for a Linode 128MB.

No. Stop it. Never ever scale your blog.

It sounds like the author was making the same mistake that pretty much everybody makes: Treating your blog as though it were dynamic content. But it's not. It's static HTML, and you should never have to make any modifications to anything to make it scale.

Step one: Have your blog export all entries to plain HTML.

Step two (optional): move your imagery out to S3/Cloudfront.

That's it. That will allow your little out-of-the-box slice handle all the traffic that we can throw your way.

Scaling is an issue that you're meant to have with your product. Because your product actually needs to talk to databases and do things, it may have trouble doing those things when lots of people hit it at once. A website hosting a blog, on the other hand, needs to serve files. And that's been a solved problem for fifteen years.

The problem became unsolved in the interim, largely because it was virtually impossible for the 2001 Internet to overwhelm the limits of Apache's default settings, whereas it is fairly easy in 2011 to do that with social media.

There have been three posts on my blog this year which would, with absolute engineering certainty, have effectively DOSed Apache if I had kept the Ubuntu default settings. All involve numbers which are really small for computers, like 300k (hits in a day)

It took me years of blogging to realize why this happened and address it, despite my blog running on a beefy machine and me theoretically having experience with much harder problems than serving 20k of plain text repeatedly.

Have you considered simply moving your entire blog onto Cloudfront? It would take a lot of traffic to bring that down, and deploying to it is essentially as easy as deploying to a web server.

I'm afraid I'm going to have to stand by my assertion that static file hosting is a solved problem in 2011. I think the real issue we're seeing with all these "Slashdotted blogs" is that the database-based-blog is the 20 minute intro lesson for every new server-side tech. The result is that everybody thinks about blog hosting as a problem involving taking content from the database and displaying it to the user. This leads to things like caching and other performance hacks that could be done away with if you simply thought of the problem in terms of hosting files.

Kind of. This is all good advice, but WordPress is still the major blogging platform, and it is awful at scaling out of the box. You can get Super Cache or Total Cache but most people don't do that up front because they're not expecting the traffic, and even those won't do S3/Cloudfront for the static content in your site's themes.

If you're stuck with traffic to your blog, you should look at an existing caching solution and maybe scale your VPS a little. It's not likely to be a permanent thing.

If the blog supports comments, it needs to talk to databases.

Step one: Have your blog export all entries to plain HTML.

He did that.

A website hosting a blog, on the other hand, needs to serve files. And that's been a solved problem for fifteen years.

"Too few Apache threads" is a known problem, which he recognized as soon as he saw the load numbers.

> If the blog supports comments, it needs to talk to databases.

Sure, but it doesn't have to fall over. Outsourcing to Disqus is the easiest solution, but you can build your own AJAXy solution. Or just write out a new static file for each comment (if you're really overloaded, comments may take a while to be processed, but you can just serve the old page in the interim.)

"If the blog supports comments, it needs to talk to databases."

No it doesn't. Comments aren't added all that much relative to how much your site is getting hit. It would make more sense that the page is just a static html file with a form. The form submits to your app engine (php/python/whatever) which adds the data to the database. That then triggers your static html file to be rewritten to display that comment as well. The only other modification you "may" want to make is set it so that browsers don't cache your html page so they can see new comments if they refresh.

He did that.

Exactly my point. He did that and it worked.

If you have a blog that may one day see traffic, I'd recommend taking that bit of the post to heart and skipping the whole "serve blog content via the database" part altogether.

My number 1 tip: ditch Apache.

One of my web apps has the occasional spike in traffic that previously caused Apache to consume vast amounts of memory on my VPS, eventually crashing it due to lack of memory.

After reading many guides, experimenting, and generally getting quite frustrated (and working out what VPSs I could afford to upgrade to), I tried setting up Nginx on a separate port. It took maybe 1 hours for me to have my former LAMP stack set up and working, so I put it live, and haven't looked back since.

If you're on a VPS, use Nginx. The config file is wildly different to that of Apache, and you'll no doubt spend a few minutes cursing trying to figure out how to port over your rewrite rules, but after that it's plain sailing.

This. As an added bonus, I've also found that I prefer Nginx's configuration files.

Worst case, you can serve static files with Nginx and route dynamic requests to your Apache instance (I still do this with a few old PHP apps I have).

Using a reversed proxy (e.g. nginx's built in, or maybe varnish) is a much easier and dynamic solution then to render things to static files.

Agreed. Heroku makes it dead simple to use their Varnish proxy. Just set one http header and they'll cache the page for you. After that, your app doesn't do anything until the cache expires. This obviously won't work for highly dynamic pages, but for semi-static front pages/blog entries it can be a lifesaver.

I'm not at all familiar with Linode's iPhone app - but the bottom graph looks an awful lot like a system load graph, and not a CPU utilization graph.

"CPU utilization never exceeded 3%"? Really? Maybe the system load was at 3.0 for a few days?

It is CPU utilization. Linodes have access to four cores so the graph's scale is from 0% to 400%, which might not necessarily synchronize with top. Same information as the Linode Manager, just a different time scale to fit constraints of the iPhone screen. The same graph on my personal Linode, from the Linode Manager, has a legend that identifies itself as percentage:


That number comes from outside of the Linode, not within it. The 14-day graph is also a 2-hour average, which means each value is over a particularly large period and big spikes will be smoothed out; I don't think the results are too far from the real world here, though. Not a lot of CPU time is required to fetch things from various places and transmit over the network...it's mostly wait time.

Completely agree, that a system load graph not CPU usage.

No, it's CPU usage, not a load average.

> With amazon's EC2 service, I'll be able to deploy as many temporary mirrors as I need in just a few minutes

Can you go into detail on how does that work? I thought you had to have your site originally hosted on EC2 to do that ...

I'm curious about this too. I'm imagining he adds a redirect to the EC2 instance in an .htaccess file for the page that is getting hit hard?

>>(Step 1: Cut image quality) Page load time dropped from 24 to 12 seconds.

Wow. I had no idea that could make such a difference. I suppose the issue was with the low number of threads set in the Apache configuration. The server was spending its time sending out static content when it could have been doing more important things? I signed up with S3 to serve up my static content to keep that load off my server. I should probably be using CloudFront instead though.

I'm interested in looking for a backup host just in case. I currently use WebFaction and I love them to death -- but I'm worried that under incredible stress the shared hosting won't hold. With Linode, do you start from scratch with a blank OS and just install everything you need from there (Apache, mod_wsgi, etc. kind of thing) or do they have preset installs? With WebFaction I can select a particular setup and I'm up and running in minutes.

Linode offers StackScripts, which can automate a lot of the initial deployment steps. http://www.linode.com/stackscripts/

One [probably naive] thing that I've done in the past has been to use mod_rewrite to redirect people to a static version of a page.

Something like this would work in .htaccess

RewriteRule ^t/item/4372/$ /static/4372.html

It's saying "Hey, apache, if you see somebody asking you for website.tld/t/item/4372/, send them to website.tld/static/4372.html instead"

A blog post I wrote got about 100k hits in a day a few weeks ago, and using mod_rewrite in this fashion, I was able to keep the site running for the entire day.

Nice post. It would be nice to read a comparative between apache and nginx on these cases.

Where the hell have I been, didnt realize Linode has an iphone app.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact