

Ask HN: How many servers do I need to handle 100 mln hits per month? - adserverboy

Hi guys,<p>I'm trying to build a server farm filled with commodity machines to handle web traffic for a free public information government site.<p>The expected load anywhere from 50 million to 100 million hits per month.<p>The site will running on commodity servers with up to 4GB of RAM. It runs in a LAMP stack (plus memcached) and from our tests (under medium load), it processes requests in about 0.3 seconds.<p>Is there a rule of thumb that can help me pseudo-reliably predict how many servers I will need to handle up to 100mln impressions, and even 1 billion?<p>Has anyone done something similar?<p>Thanks in advance.
======
asb
I think you're really missing something by just looking at just the number of
total hits over a month. How do you expect them to be distributed? What do you
expect the peak rate to be? How slow is the site allowed to be at this peak?

~~~
grandalf
this is the key question. you have to build for your peak access rate.

Also, are you replacing an existing service with a known traffic pattern? If
not, how are you coming up with your estimate?

Is there any way you can dial back the service a bit if load exceeds capacity
-- maybe scheduling the cron jobs for every hour instead of 15 mins? If so
then you might want to start with a smaller number of servers and provision
additional ones as you are sure of the load.

Also, why not use ec2, slicehost, or some approach that lets you grow (or
shrink) your infrastructure as needed?

~~~
adserverboy
thanks grandalf.

We are consolidating a number or services from different departments into one
single access point.

We are also open to using ec2. The question on my mind is how do I estimate
how many ec2 (small or large instance) servers (and consequently the budget)
we need to support the expected traffic.

~~~
grandalf
Well, with ec2 you can fire up an instance for a few hours and do some load
testing.

I recommend creating a load test with characteristics that mimic what you
think you'll see in production and then just run apachebench or httperf from
another ec2 instance.

You might even want to include in your code some calls to sleep or some slow
loops, or even some code to create more frequent cache misses, all of this
might help you generate a more conservative estimate.

These are just a few ideas. I think you could probably come up with a
reasonable guess at costs this way, all for a few hours of ec2 time.

A lot depends on your load and how memory vs cpu vs IO intensive it is, and it
will all impact which sort of load balancing you end up deciding to use. I
suppose even if you don't end up going with ec2 it would still provide some
useful benchmark results.

~~~
adserverboy
thanks grandalf.

I'll give it a go.

Also found this site - <http://highscalability.com/>

Very good resource.

------
brk
What does a "hit" consist of?

A static text page?

A static text page with a few images?

How many of the hits will be from (and/or NEED to be from) database-driven
queries?

How much of a "hit" can be cached?

Are visitors typically repeat (where browser-cached objects could be of
benefit) or new (where you'll need to look more at CDN caching)?

~~~
adserverboy
Thanks for the quick reply.

A hit is is approximately about 2Kb of data built by making on average 4
memcached calls per request.

database access is rare.

We have two background cron jobs (per server) that run every 15 minutes.

Cron A - checks for dirty entities and invalidates them in memcached.

Cron B - extracts traffic logs from memcache and stores them in a remote
database.

Thanks

~~~
peterhi
You might want to look at how your cache keys are created, if they are
designed correctly memcached will expire them automatically removing the need
for a cron.

Of course how much of a saving this will be is another question.

