

Server Power-ups That Worked For Our Startup: Handling 10x Traffic - robert_mygengo
http://mygengo.com/talk/blog/server-power-ups-that-worked-for-us-handling-10x-traffic/

======
dolinsky
Having been there a few times there are a few suggestions I'd like to put out
there. The points here should take a very minimal amount of time (< 1 week
combined) but they are meant to be easy ways of getting more mileage out of
what you currently have.

#1. Move all of the non-db specific crons off of your db servers (those that
don't specifically involve the rotation of logs/cleanup of tmp directories)
and onto one of your web heads. There is nothing more painful then seeing your
master db crash because of a cron job or other non-essential role on your db
server. It's much easier to recover from / spin up a new web head if it
crashes then to restart mysql after your master db crashes (a slave db is
easier, but still more of a pain point than a web head). If at all possible,
since you're already on EC2, spin up a worker server just to perform these
cron jobs (stats collection, db cleanup scripts, etc). EC2 affords you this
kind of flexibility. Take it.

#2. Hooking into a CDN is very easy to do and can allow your web servers to
offload the serving of static files. As an easy way of doing cache-busting set
up a global version variable in your code that gets injected into each static
url that's generated (so a request for cdn.mygengo.com/images/29384/bar.jpg
would be rewritten to look for /images/bar.jpg on your web heads if and only
if the cdn'd cache had either expired or the unique # had changed in the
request).

#3 Set up some sort of perfomance monitoring if you haven't already. Almost
every startups #1 pain point is the db server, specifically queries. It's
great that you've been able to see such a vast improvement through adding
indexes / cleaning up some queries, but as long as your data is growing you
will need to continue to keep on top of your query logs as their impact
footprint will change as the size of your data does. Since you're on AWS,
enabling Cloudwatch should be very easy and provides the basics. You can then
move onto implementing Nagios/Munin or look to something like
Cloudwatch/ServerDensity for an in-the-cloud solution.

You also didn't mention anything about having a K/V system like Memcache or
Redis (yes @antirez we all know Redis rocks and how dare I place it after
Memcache ;) which can really take the load off your db servers.

Best of luck with continued growth.

------
cagenut
You can get away with following oral(blog)-history scaling lore at that low
end, but before you scale much further the most important thing by far is to
be more metrics and measurement oriented. The most basic one to start with is
average http response time and http requests per second.

For instance, say your slicehost returned a page at an average of 200ms, but
only below 10 req/second, over that it started to bottleneck somewhere and
responses would shoot up to 500+ms. Then your question is "how to I handle 100
http requests per second while keeping the average at or below 200ms".

Eventually you'll need to step up to full-browser-render (gomez, keynote,
browsermob, homegrown-selenium) time, appdex or 95th% based numbers instead of
simple averages, and backend components like mysql req/s and service time.

The point is, this is engineering, use numbers.

------
edanm
Thanks for the writeup. Hope to have these problems myself someday soon!

Fwiw, one pain point for me has been using EC2. I started off with EC2 from
the start, so that I could learn the system now and scale up whenever I need.

Unfortunately, it seems that the performance of EC2 instances is _terrible_.
Or at least, that's what it looks like for me. After a few weeks of trying to
get performance boosts by tweaking code, I decided to try a small Slicehost
slice, which performed much better at running the exact same system.

My friend (who recommended Slicehost) did the same test on Large EC2 instances
as well, and reported that they _also_ appeared oversubscribed.

------
holdupadam
I haven't really had to scale a system up like this before. As a freelance web
developer moving to a bigger slice? on linode has been good enough for any
wordpress blog or simple site.

That said, I didn't realize what a massive undertaking it could be like this
describes.

Anyone had any similar scaling challenges with less conventional web-stacks?
Im particular I'm wondering how a node.js/mongodb stack would do this same
thing.

~~~
moe
I have scaled many systems in the past and the easy answer is: Know what
you're doing or hire someone who does.

There is no blanket answer because applications and workloads differ. E.g.
MongoDB has a number of documented issues under high load (esp. high write
load), in some scenarios they can be worked around with reasonable trade-offs,
in some they can't and Mongo needs to be replaced.

If there's one general thing to say about scaling then that you scale
_applications_ , not stacks. The stack will usually change in the process.

~~~
fvbock
> If there's one general thing to say about scaling then that you scale
> applications, not stacks.

very true! which is why i think it usually is a good idea to analyze how you
could split up your application into more or less discrete
subsystems/"services". that way it becomes easier to react to bottlenecks in
the subsystems instead of having to operate on the whole system.

ideally you design your system like that in the beginning but usually you grow
and find out the harder way... step by step.

still there are a couple of things that are "always (about) the same": for
example do session mangagement or logging/sitestats usually put quite a bit
more write load (INS/UPD/DEL) on a system than other parts of a system.

so if your product grows the write stress these subsystems/functionalities put
on a system on the one hand and the read load from other subsystems on the
other hand make it hard to optimize your db/storage systems because they have
quite contrary I/O needs.

if you have separate subsystems that run as discrete services with their own
API it becomes much easier to put out a fire that breaks out due to growth or
load spikes.

so to add to the comment by moe: "you scale applications" - and you get into a
better position to do so if you design your applications in a way that they
are composed of seperate subsystems that communicate via apis.

------
wildmXranat
| _In some cases we managed to speed up integral queries by over 8000%! That's
both an indication of the growing pains_

That is an indication of sloppy db planning and bad SQL use. You should test
with a data-set size and type that resembles a live data-set in the outcome
you become successful. Yikes.

~~~
dolinsky
While I share your 'hype' skepticism, that gain could easily be gained by
adding a missing index. By itself the above statement really doesn't say much,
either about the amazing improvement or the quality of the current schema in
place.

------
po
This is a great writeup Robert. Thanks for the overview.

What were the primary factors that signaled it was time to move to AWS for you
guys? What else were you considering at the time?

------
Schmelson
Good topic, but of course if you were starting now you would just begin on EC2
and be done with it.

~~~
moe
Depends on the goals. They didn't say how much their "10x traffic" actually
is, but by the sound of it they could probably save money by moving to
dedicated hardware.

------
adam0101
Text gets cut off in mobile safari.

~~~
holdupadam
Thanks for the heads up - this will be fixed in a few hours.

------
chirish
This is a remarkably sane and bullshit-free description

~~~
holdupadam
Ya, but that comment is not bullshit-free.

