
Ask HN: How does one gain expertise in scalability? - groaner
My experience is mostly in experimental software and prototypes that never made it into production.  I have some vague ideas of techniques that are used to scale applications and have read materials such as that on http://highscalability.com, but no practical experience in this area, nor any projects where I can apply any of this information.<p>For those of you who are knowledgeable about scalability, how did you develop that knowledge?
======
buro9
When you need to scale, you'll find you'll be able to.

The question is probably more, "What can you do today to make scaling less
painful when I need to, but without wasting effort today?"

The answer to that is to monitor everything.

Scaling is about being able to spot the bottle neck in the system and to
remove it or work around it.

Unfortunately that's a hard thing to do, you need to be aware of everything:
networking layers, hardware, software, how things work (or don't work)
together. You've got to be able to holistically see the entire application and
understand how one part impacts the rest.

Only obsessive monitoring can help you there unless you're a little bit savant
and can visualise the whole system.

When you identify a bottle neck, you re-design and remove it.

How to make that easier? I find thinking of large systems in terms of a
network of systems that each do "one thing well" helps. It makes identifying
bottle necks easier as you cannot have doubt where a bottle neck is if a
machine (or it's network) is running hot and it really does just a single
task.

So the only advice I give people in advance of them having pain points is to
not mix the use of a part of the system. If you have a database server, make
sure that is the only thing that server does.

For anyone trying to learn how to scale, the way to scale differs from system
to system. Very few people could tell you anything, though a lot of
highscalability.com is general enough for web apps that you can build up an
arsenal of approaches to common problems and turn to those when you spot
bottle necks emerge.

Ultimately: you can't learn how to scale until you have to, but you can design
your system such that it all does one thing well and is obsessively
monitored... at least then when you need to you're in a good place to be able
to react to that demand.

~~~
buro9
I failed to give you a starting point on monitoring, try these:

<http://graphite.wikidot.com/> (I prefer this)

<http://munin-monitoring.org/> (widely used and fairly easy to find resources
for)

Don't just monitor general things like CPU, I/O... monitor what your
application does as well. The point of monitoring is to give you the ability
to spot the interconnectedness of symptoms to their cause.

Knowing that CPU spiked shortly after networking became saturated isn't as
helpful as knowing that your calls were doing requests of a large number of
records and returning more data than needed. So if you fail to monitor your
actual application too, you only have half of the picture.

------
patio11
This is less about scaling and more about any random technical knowledge you
might need:

1) Companies with Serious Scalability Needs have a lot of organic knowledge in
them, which you gain both a) by being mentored by senior team members and b)
by baptism-by-fire.

2) Companies with Serious Scalability Needs have training budgets larger than
YC's total amount invested to date. (Just a convenient ballpark figure for
"Holy cow, that's real money!") They spend it to e.g. send engineers to
conferences where people with organic knowledge of the subject will teach you
best practices (i.e. Here's how we got burned, here's what resulted in less
burnination). For example, ex-day-job dropped ~$20k to send myself and one
other engineer to JavaOne, where we heard e.g. the chief architect from
LinkedIn talk about what sort of caching strategy you need to have writes
appear instantly consistent to the user who initiated them while achieving
eventual consistence over many thousands of other users who they would fan-out
to.

3) Largely as a result of the conferences in #2, there are presentations and,
more rarely, videos online produced by people who do this sort of thing for a
living. You found HighScalability.com , so that is good. I have learned a
_tremendous_ amount of not-covered-in-college-or-personal-experience technical
data by consuming SlideShare presentations and following the output of people
who routinely produce good stuff on particular topics. (For example, not
exactly scaling, but the YSlow team created _huge_ amounts of value for me by
presenting on the how and why of web-page performance optimization several
times.)

4) People who know this stuff are available to teach it to you, if you have
tens of thousands of dollars to spend on it. If you don't have tens of
thousands of dollars to spend, you probably don't _really_ have scalability
problems.

5) Most startups have the scaling problem "We have no scaling problems!" For
the rest, there's a lot of "Smart people plus challenges plus lots of
headaches = we pushed our personal skill levels a step forward."

------
snewman
There's an old saying attributed to Michael A. Jackson (no, not _that_ Michael
Jackson):

    
    
      The First and Second Rules of Program Optimisation
      
      1. Don’t do it.
      2. (For experts only!): Don’t do it yet.
    

Scaling is similar. If you worry about scaling before you're at scale, you're
almost certain to spend most of your effort fixing things that won't actually
turn out to be bottlenecks. So, as a few other posters have said, wait until
you have scale problems and then address them.

A safer variant of this make a copy of your system and hit with with a
simulated load. However, it's hard to do this in a usefully realistic way.
Usually, you'll be launching at small scale and growing gradually. In such
situations, it's best to launch, observe actual traffic patterns, and model
your simulated load on that. To recap:

1\. Build your application, following rule 1 (don't worry about scale).

2\. Launch.

3\. Observe your actual traffic, and use it to build a realistic load
generator. (Depending on your application, you may be able to simply grab a
day's worth of logs and replay them at high speed.)

4\. Run a copy of your system and hit it with 10x your actual load. See what
breaks; fix; repeat.

That said, here's one tip that will help a lot. A stateless system is easy to
scale -- just run more copies of it. Of course, most interesting systems
aren't stateless. But often you can push the state into a database. Then
you're writing a stateless server that sits on top of a database. Now your
code is easy to scale. You're left with the problem of scaling the database,
so: at the outset, choose a database that will scale to your needs. This is
nontrivial, but is much easier than scaling your own custom stateful code.

~~~
mjb
> This is nontrivial, but is much easier than scaling your own custom stateful
> code.

Yes, this is an importance piece - and disagrees somewhat with your "don't
think about scaling yet" point. Probably the most important part of designing
for scale is separating the design into stateless, soft-state and durable
state pieces. The stateless pieces are trivially easy to scale (with enough
money), either by improving efficiency, using bigger machines or using more
machines. Soft-state (read caches, write-through caches, etc) are a little bit
harder - but still won't need a huge amount of care.

The difficult piece is durable state storage (typically a database). When you
are just starting out, a big database of everything is probably good enough.
Chose a solid, widely used data store (MySQL, Postgres, Mongo, Oracle, MSSQL,
etc) and use it. Do as little in this layer as you can - it's going to be the
most expensive and difficult to scale. Put your business logic elsewhere.
Protect it from read spikes with a cache. Design your schemas carefully.

Depending on your read/write mix, data volume and the requirements of your
application, you can get pretty big (1000s of reads per second, 10s of writes)
without any special database hardware or knowledge.

Things to remember:

1) Avoid tight coupling between stateless and durable state parts of your
application. There is nothing wrong with running your DB on the same box as
your web server when you are small, but don't write code that assumed that
architecture.

2) Chose your data model well. Think carefully about the Nouns in your
business, and the relationships between those Nouns (much like you would in OO
design). Chose the interfaces and relationships between nouns carefully to
reduce linkage. Try to keep interfaces clean.

3) Measure, don't assume. Your page loads slowly? Don't throw out Apache and
replace it with Nginx. Measure what is taking the time, and concentrate on the
slow piece. If your actual web server is slowing you down, then change
servers. Most often, though, slowness is going to be either in your database
or your application code.

4) Worry about interfaces, objects and design. Don't worry about technology.
The latest buzzwords will not save you from bad design practices - and bigger
hardware will only save you for so long.

~~~
einhverfr
_The difficult piece is durable state storage (typically a database). When you
are just starting out, a big database of everything is probably good enough.
Chose a solid, widely used data store (MySQL, Postgres, Mongo, Oracle, MSSQL,
etc) and use it. Do as little in this layer as you can - it's going to be the
most expensive and difficult to scale. Put your business logic elsewhere.
Protect it from read spikes with a cache. Design your schemas carefully._

I disagree with some of this. In general the worst db scaling I have seen are
ones with large numbers of simple queries based on the idea of doing as little
in the db as you can. Instead I would suggest two principles regarding making
the db a little more scaling-friendly:

1) Everything that needs to be queried together should be queried together.
Don't do lots of round trips and simple queries.

2) Don't do stuff in your database that it isn't designed to do. Write good
queries, but don't do things like send emails from the db backend.

A corollary here is that you should write your queries with performance in
mind but not do too much premature optimization. For example, it's a lot
easier to go from a group by to a sparse index scan (using a stored proc or a
CTE) than it is to lock yourself into a sparse index scan from the get go.

In general, the four points you mention though are extremely well thought out.
Following up on #4, although technology is important sometimes (esp. newly
maturing technologies here like Postgres-xc), these are going to be far more
useful where an app is well designed than where it is not. After all, if you
don't know where the bottlenecks are in your app, you can't put effort into
the right places to fix them.

------
ook
I have helped scale (non web) low latency systems in some pretty stressful
situations.

In addition to the sage advice about monitoring / metrics, mentoring and not
scaling until necessary I think the following are useful:

* Design Services not Software. In particular read "On Designing and Deploying Internet-Scale Services" (<http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf>) and at least the first chapter of "The Art of Unix Programming" (<http://catb.org/~esr/writings/taoup/html/>)

* Get to grips with debugging and profiling so you can figure out what's really happening. Tools like sar, sysstat, [d|k|s]trace, tcpdump, gdb etc and the equivalents for your datastore & application frameworks are invaluable and unfortunately for whatever reason you inevitably won't have all the metrics and monitors you need.

* Do try to understand every layer of your service. I have helped debug scale out related issues from Layer 2 to Layer 8. I have also had to debug many Layer 1 issues while bringing up a new Site or similar. I may not be a DBA, Network Engineer or Software Engineer but in the past I have had to wear those hats while scaling.

* Despite comments elsewhere about learning through Mentoring and baptism-by-fire there is a lot of real engineering & science theory you can lean on. Looking back on courses I took in school while I didn't do any courses on Scalable Web Programming over the years I have used content from courses on Computer Architecture, Math including Queuing Theory and Statistics, Systems Programming (OS & Network).

~~~
ook
I just found these slides from a MSC Module at an Irish university on
"Software in Production" which are well worth reading through if you're tasked
with scaling systems and have a Software Development background:

<http://www.maths.tcd.ie/~niallm/day[1-5].pdf>

------
cheald
I can think of two ways. One, work under someone with that experience, solving
those sorts of problems. The other, which is far more thrilling/terrifying, is
to run into scaling problems, and have to figure out how to fix them. There's
no teacher like experience.

As others have mentioned, though, the first step is to monitor and log
everything. You want metrics on what people are doing, what your server is
doing, what your network stack is doing, and everything in between. The more
data you have, the more tools you have to a) locate, and b) fix problems. You
don't know if something has improved if you don't have a baseline to measure
it against!

Some of the tools I use include Mixpanel/Google Analytics (for understanding
user behavior), Newrelic (for server-side runtime metrics), MongoDB profiling
and the MySQL slow query logs (for database sticky spots), and Munin (for
server/network monitoring). You'll want to build your own toolset depending on
the needs of your network, but once you've figured out how to collect the
data, you're a lot further than you think.

------
michaeldhopkins
One way to practice is to DDOS your server until you figure out how to
withstand it, then steadily increase the load. In fact, several services help
you do this and provide some nice statistics too.

~~~
mjb
Be careful with load testing.

When your business/community/etc is 10x the size it is today, it's unlikely
that usage patterns and use cases will be the same. Rather, they are likely to
shift as you grow.

Simply firing 10x today's load at the system is one way to find out today's
bottlenecks, but may be ineffective at finding tomorrow's bottlenecks. That
doesn't mean it's not worth doing, but it needs to be done with care.

------
biesnecker
In my (limited) experience, it's the same way you become an expert in
parenting -- do it, learn from everything that goes wrong, and incrementally
do it better.

There are basic principles you can learn, but every service (child) is
different, and you'll have to take that basic general knowledge and figure out
how to apply it in your specific case.

------
ciscoriordan
Run into scaling problems.

Seriously though, here are the lectures notes for Stanford's CS193S Scalable
Web Programming:

[http://www.stanford.edu/class/cs193s/syllabus_and_slides.htm...](http://www.stanford.edu/class/cs193s/syllabus_and_slides.html)

------
genieyclo
<http://news.ycombinator.com/item?id=2249789>

------
trussi
Build and install a scaled out application. Multiple AWS instances in a load
balanced configuration. Use small instances to purposely bottleneck the
available resource pool.

Use something like Blitz.io to beat the hell out of the application. This is
your baseline

Use performance monitoring and optimization tools to improve the architecture
and application. Go through each layer of the application to see where the
biggest gains can be found. Also look/test for data consistency.

Rinse and repeat with blitz.io.

Also, test how well the architecture responds to various types of server
failures.

And make sure you test a viable backup process as well.

Lastly, once you have a nice, performant architecture, increase system
resources both up (more resources per server) and out (more servers) to see
how well the system actually scales. :)

~~~
armandososa

        Build and install a scaled out application. Multiple AWS instances in a load balanced configuration. Use small instances to purposely bottleneck the available resource pool.
    

But where do you learn how to install a load balancer? I know I can google
that particular example. But as with learning anything new, the big problem is
when you don't know what you don't know.

------
HowardRoark
There are different levels of scaling, and for different kinds of apps, the
trade offs that you can make are different. Unless you are making a really
huge leap, most of the time it will be a gradual process. So unless you are
talking Google or Twitter Scale in a matter of months, it may not be as bad as
it sounds.

Monitor, optimize your current configurations, cache, load balance, cluster,
use messaging, use existing industry knowledge on distributed systems, read
academic papers and innovate... For advanced distributed systems, checkout
Prof. Indranil Gupta's lectures amont many others:
<http://www.cs.uiuc.edu/class/fa10/cs425/lectures.html>

------
rkalla
I find of the most exciting tidbits about scalability from reading the mailing
lists of popular data stores that people are ramping up to huge deployments
(MongoDB and Cassandra primarily, but Redis and CouchDB occasionally).

All those people with multi-TB systems scaled across many data centers
inevitably end up on those lists asking great questions about exactly the
types of problems you run into at that level.

It helps give you perspective before _you_ are the one with a geographically
disperse collection of terabytes of data; which can be a godsend.

------
soho33
from personal experience, you don't gain expertise in scalability until it's
too late and the website is down and you are running around to get it back
up!!!

when i first initially coded the website, i took some shortcuts to get the
product out there soon before it was too late. Once we launched and the
traffic started growing and i noticed the slow response times, i went back to
the basics and went through all my pages and tried to optimize as much as i
could. A lot of it I had to learn on the fly. Anything from SQL query
optimization, playing around with database indexes to speed up the searches
and then implementing Memcache and finally separating and loadbalancing the
database and web server. Many of these issues you can prepare yourself for but
the best way is to fall right in the middle of it when your site goes down and
you need to get it back up!!

i'm sure everyone who was in the same position at one point would agree that
even though the site is down and you have to go through and optimize a lot of
code, it's GREAT FEELING knowing that people are actually visiting your site
and using your creation!

------
mblakele
Good stuff already. But you don't necessarily have to start by testing to
destruction. Do some hammock design. Think about your tech stack. What do you
think will be the limiting factors under typical user load? The next step is
to simulate that load, and address any failures.

------
iradik
go work at a company and team that does software at scale.

------
foobarbazetc
Design for 10x scale to start with.

Wait until your site goes down due to unanticipated bottlenecks.

Fix bottlenecks.

Repeat.

~~~
chii
So essentially, you learn by failing? that used to get you killed in the
caveman age, and these days, that tends to get you fired.

These are the sorts of things that you have to learn via 2nd hand experience i
reckon.

~~~
rbrcurtis
its not quite that bad really. unless you are scaling amazingly fast, like say
twitter did, usually you get site slowness that you can troubleshoot and fix,
not a completely busted website. my suggestion is to read some books/websites
on the subject, and make sure you architect early to accomodate scaling later.

