Hacker News new | more | comments | ask | show | jobs | submit login
Ask HN: How does one gain expertise in scalability?
49 points by groaner on Dec 5, 2011 | hide | past | web | favorite | 26 comments
My experience is mostly in experimental software and prototypes that never made it into production. I have some vague ideas of techniques that are used to scale applications and have read materials such as that on http://highscalability.com, but no practical experience in this area, nor any projects where I can apply any of this information.

For those of you who are knowledgeable about scalability, how did you develop that knowledge?

When you need to scale, you'll find you'll be able to.

The question is probably more, "What can you do today to make scaling less painful when I need to, but without wasting effort today?"

The answer to that is to monitor everything.

Scaling is about being able to spot the bottle neck in the system and to remove it or work around it.

Unfortunately that's a hard thing to do, you need to be aware of everything: networking layers, hardware, software, how things work (or don't work) together. You've got to be able to holistically see the entire application and understand how one part impacts the rest.

Only obsessive monitoring can help you there unless you're a little bit savant and can visualise the whole system.

When you identify a bottle neck, you re-design and remove it.

How to make that easier? I find thinking of large systems in terms of a network of systems that each do "one thing well" helps. It makes identifying bottle necks easier as you cannot have doubt where a bottle neck is if a machine (or it's network) is running hot and it really does just a single task.

So the only advice I give people in advance of them having pain points is to not mix the use of a part of the system. If you have a database server, make sure that is the only thing that server does.

For anyone trying to learn how to scale, the way to scale differs from system to system. Very few people could tell you anything, though a lot of highscalability.com is general enough for web apps that you can build up an arsenal of approaches to common problems and turn to those when you spot bottle necks emerge.

Ultimately: you can't learn how to scale until you have to, but you can design your system such that it all does one thing well and is obsessively monitored... at least then when you need to you're in a good place to be able to react to that demand.

I failed to give you a starting point on monitoring, try these:

http://graphite.wikidot.com/ (I prefer this)

http://munin-monitoring.org/ (widely used and fairly easy to find resources for)

Don't just monitor general things like CPU, I/O... monitor what your application does as well. The point of monitoring is to give you the ability to spot the interconnectedness of symptoms to their cause.

Knowing that CPU spiked shortly after networking became saturated isn't as helpful as knowing that your calls were doing requests of a large number of records and returning more data than needed. So if you fail to monitor your actual application too, you only have half of the picture.

This is less about scaling and more about any random technical knowledge you might need:

1) Companies with Serious Scalability Needs have a lot of organic knowledge in them, which you gain both a) by being mentored by senior team members and b) by baptism-by-fire.

2) Companies with Serious Scalability Needs have training budgets larger than YC's total amount invested to date. (Just a convenient ballpark figure for "Holy cow, that's real money!") They spend it to e.g. send engineers to conferences where people with organic knowledge of the subject will teach you best practices (i.e. Here's how we got burned, here's what resulted in less burnination). For example, ex-day-job dropped ~$20k to send myself and one other engineer to JavaOne, where we heard e.g. the chief architect from LinkedIn talk about what sort of caching strategy you need to have writes appear instantly consistent to the user who initiated them while achieving eventual consistence over many thousands of other users who they would fan-out to.

3) Largely as a result of the conferences in #2, there are presentations and, more rarely, videos online produced by people who do this sort of thing for a living. You found HighScalability.com , so that is good. I have learned a tremendous amount of not-covered-in-college-or-personal-experience technical data by consuming SlideShare presentations and following the output of people who routinely produce good stuff on particular topics. (For example, not exactly scaling, but the YSlow team created huge amounts of value for me by presenting on the how and why of web-page performance optimization several times.)

4) People who know this stuff are available to teach it to you, if you have tens of thousands of dollars to spend on it. If you don't have tens of thousands of dollars to spend, you probably don't really have scalability problems.

5) Most startups have the scaling problem "We have no scaling problems!" For the rest, there's a lot of "Smart people plus challenges plus lots of headaches = we pushed our personal skill levels a step forward."

There's an old saying attributed to Michael A. Jackson (no, not that Michael Jackson):

  The First and Second Rules of Program Optimisation
  1. Don’t do it.
  2. (For experts only!): Don’t do it yet.
Scaling is similar. If you worry about scaling before you're at scale, you're almost certain to spend most of your effort fixing things that won't actually turn out to be bottlenecks. So, as a few other posters have said, wait until you have scale problems and then address them.

A safer variant of this make a copy of your system and hit with with a simulated load. However, it's hard to do this in a usefully realistic way. Usually, you'll be launching at small scale and growing gradually. In such situations, it's best to launch, observe actual traffic patterns, and model your simulated load on that. To recap:

1. Build your application, following rule 1 (don't worry about scale).

2. Launch.

3. Observe your actual traffic, and use it to build a realistic load generator. (Depending on your application, you may be able to simply grab a day's worth of logs and replay them at high speed.)

4. Run a copy of your system and hit it with 10x your actual load. See what breaks; fix; repeat.

That said, here's one tip that will help a lot. A stateless system is easy to scale -- just run more copies of it. Of course, most interesting systems aren't stateless. But often you can push the state into a database. Then you're writing a stateless server that sits on top of a database. Now your code is easy to scale. You're left with the problem of scaling the database, so: at the outset, choose a database that will scale to your needs. This is nontrivial, but is much easier than scaling your own custom stateful code.

> This is nontrivial, but is much easier than scaling your own custom stateful code.

Yes, this is an importance piece - and disagrees somewhat with your "don't think about scaling yet" point. Probably the most important part of designing for scale is separating the design into stateless, soft-state and durable state pieces. The stateless pieces are trivially easy to scale (with enough money), either by improving efficiency, using bigger machines or using more machines. Soft-state (read caches, write-through caches, etc) are a little bit harder - but still won't need a huge amount of care.

The difficult piece is durable state storage (typically a database). When you are just starting out, a big database of everything is probably good enough. Chose a solid, widely used data store (MySQL, Postgres, Mongo, Oracle, MSSQL, etc) and use it. Do as little in this layer as you can - it's going to be the most expensive and difficult to scale. Put your business logic elsewhere. Protect it from read spikes with a cache. Design your schemas carefully.

Depending on your read/write mix, data volume and the requirements of your application, you can get pretty big (1000s of reads per second, 10s of writes) without any special database hardware or knowledge.

Things to remember:

1) Avoid tight coupling between stateless and durable state parts of your application. There is nothing wrong with running your DB on the same box as your web server when you are small, but don't write code that assumed that architecture.

2) Chose your data model well. Think carefully about the Nouns in your business, and the relationships between those Nouns (much like you would in OO design). Chose the interfaces and relationships between nouns carefully to reduce linkage. Try to keep interfaces clean.

3) Measure, don't assume. Your page loads slowly? Don't throw out Apache and replace it with Nginx. Measure what is taking the time, and concentrate on the slow piece. If your actual web server is slowing you down, then change servers. Most often, though, slowness is going to be either in your database or your application code.

4) Worry about interfaces, objects and design. Don't worry about technology. The latest buzzwords will not save you from bad design practices - and bigger hardware will only save you for so long.

The difficult piece is durable state storage (typically a database). When you are just starting out, a big database of everything is probably good enough. Chose a solid, widely used data store (MySQL, Postgres, Mongo, Oracle, MSSQL, etc) and use it. Do as little in this layer as you can - it's going to be the most expensive and difficult to scale. Put your business logic elsewhere. Protect it from read spikes with a cache. Design your schemas carefully.

I disagree with some of this. In general the worst db scaling I have seen are ones with large numbers of simple queries based on the idea of doing as little in the db as you can. Instead I would suggest two principles regarding making the db a little more scaling-friendly:

1) Everything that needs to be queried together should be queried together. Don't do lots of round trips and simple queries.

2) Don't do stuff in your database that it isn't designed to do. Write good queries, but don't do things like send emails from the db backend.

A corollary here is that you should write your queries with performance in mind but not do too much premature optimization. For example, it's a lot easier to go from a group by to a sparse index scan (using a stored proc or a CTE) than it is to lock yourself into a sparse index scan from the get go.

In general, the four points you mention though are extremely well thought out. Following up on #4, although technology is important sometimes (esp. newly maturing technologies here like Postgres-xc), these are going to be far more useful where an app is well designed than where it is not. After all, if you don't know where the bottlenecks are in your app, you can't put effort into the right places to fix them.

"4) Worry about interfaces, objects and design. Don't worry about technology. The latest buzzwords will not save you from bad design practices - and bigger hardware will only save you for so long."

Excellent advice, this is really what is critical in building a scalable application from a coding perspective

Session locking (tying a user down to a particular instance in a cluster) can be used to mitigate the statless issue, basically if its data that you are ok with losing if one instance of a cluster goes down then it should be fine to hold that state (such as a user's shopping cart, for example), this could be worth the trade-off for the extra overhead of having to persist the data to the DB. Using state has its places for some things and shouldn't be used in other situations so it all depends but I'm not sure the idea of never having state is necessarily a good hard and fast rule.

I have helped scale (non web) low latency systems in some pretty stressful situations.

In addition to the sage advice about monitoring / metrics, mentoring and not scaling until necessary I think the following are useful:

* Design Services not Software. In particular read "On Designing and Deploying Internet-Scale Services" (http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf) and at least the first chapter of "The Art of Unix Programming" (http://catb.org/~esr/writings/taoup/html/)

* Get to grips with debugging and profiling so you can figure out what's really happening. Tools like sar, sysstat, [d|k|s]trace, tcpdump, gdb etc and the equivalents for your datastore & application frameworks are invaluable and unfortunately for whatever reason you inevitably won't have all the metrics and monitors you need.

* Do try to understand every layer of your service. I have helped debug scale out related issues from Layer 2 to Layer 8. I have also had to debug many Layer 1 issues while bringing up a new Site or similar. I may not be a DBA, Network Engineer or Software Engineer but in the past I have had to wear those hats while scaling.

* Despite comments elsewhere about learning through Mentoring and baptism-by-fire there is a lot of real engineering & science theory you can lean on. Looking back on courses I took in school while I didn't do any courses on Scalable Web Programming over the years I have used content from courses on Computer Architecture, Math including Queuing Theory and Statistics, Systems Programming (OS & Network).

I just found these slides from a MSC Module at an Irish university on "Software in Production" which are well worth reading through if you're tasked with scaling systems and have a Software Development background:


I can think of two ways. One, work under someone with that experience, solving those sorts of problems. The other, which is far more thrilling/terrifying, is to run into scaling problems, and have to figure out how to fix them. There's no teacher like experience.

As others have mentioned, though, the first step is to monitor and log everything. You want metrics on what people are doing, what your server is doing, what your network stack is doing, and everything in between. The more data you have, the more tools you have to a) locate, and b) fix problems. You don't know if something has improved if you don't have a baseline to measure it against!

Some of the tools I use include Mixpanel/Google Analytics (for understanding user behavior), Newrelic (for server-side runtime metrics), MongoDB profiling and the MySQL slow query logs (for database sticky spots), and Munin (for server/network monitoring). You'll want to build your own toolset depending on the needs of your network, but once you've figured out how to collect the data, you're a lot further than you think.

One way to practice is to DDOS your server until you figure out how to withstand it, then steadily increase the load. In fact, several services help you do this and provide some nice statistics too.

Be careful with load testing.

When your business/community/etc is 10x the size it is today, it's unlikely that usage patterns and use cases will be the same. Rather, they are likely to shift as you grow.

Simply firing 10x today's load at the system is one way to find out today's bottlenecks, but may be ineffective at finding tomorrow's bottlenecks. That doesn't mean it's not worth doing, but it needs to be done with care.

In my (limited) experience, it's the same way you become an expert in parenting -- do it, learn from everything that goes wrong, and incrementally do it better.

There are basic principles you can learn, but every service (child) is different, and you'll have to take that basic general knowledge and figure out how to apply it in your specific case.

Run into scaling problems.

Seriously though, here are the lectures notes for Stanford's CS193S Scalable Web Programming:


Build and install a scaled out application. Multiple AWS instances in a load balanced configuration. Use small instances to purposely bottleneck the available resource pool.

Use something like Blitz.io to beat the hell out of the application. This is your baseline

Use performance monitoring and optimization tools to improve the architecture and application. Go through each layer of the application to see where the biggest gains can be found. Also look/test for data consistency.

Rinse and repeat with blitz.io.

Also, test how well the architecture responds to various types of server failures.

And make sure you test a viable backup process as well.

Lastly, once you have a nice, performant architecture, increase system resources both up (more resources per server) and out (more servers) to see how well the system actually scales. :)

    Build and install a scaled out application. Multiple AWS instances in a load balanced configuration. Use small instances to purposely bottleneck the available resource pool.
But where do you learn how to install a load balancer? I know I can google that particular example. But as with learning anything new, the big problem is when you don't know what you don't know.

There are different levels of scaling, and for different kinds of apps, the trade offs that you can make are different. Unless you are making a really huge leap, most of the time it will be a gradual process. So unless you are talking Google or Twitter Scale in a matter of months, it may not be as bad as it sounds.

Monitor, optimize your current configurations, cache, load balance, cluster, use messaging, use existing industry knowledge on distributed systems, read academic papers and innovate... For advanced distributed systems, checkout Prof. Indranil Gupta's lectures amont many others: http://www.cs.uiuc.edu/class/fa10/cs425/lectures.html

I find of the most exciting tidbits about scalability from reading the mailing lists of popular data stores that people are ramping up to huge deployments (MongoDB and Cassandra primarily, but Redis and CouchDB occasionally).

All those people with multi-TB systems scaled across many data centers inevitably end up on those lists asking great questions about exactly the types of problems you run into at that level.

It helps give you perspective before you are the one with a geographically disperse collection of terabytes of data; which can be a godsend.

from personal experience, you don't gain expertise in scalability until it's too late and the website is down and you are running around to get it back up!!!

when i first initially coded the website, i took some shortcuts to get the product out there soon before it was too late. Once we launched and the traffic started growing and i noticed the slow response times, i went back to the basics and went through all my pages and tried to optimize as much as i could. A lot of it I had to learn on the fly. Anything from SQL query optimization, playing around with database indexes to speed up the searches and then implementing Memcache and finally separating and loadbalancing the database and web server. Many of these issues you can prepare yourself for but the best way is to fall right in the middle of it when your site goes down and you need to get it back up!!

i'm sure everyone who was in the same position at one point would agree that even though the site is down and you have to go through and optimize a lot of code, it's GREAT FEELING knowing that people are actually visiting your site and using your creation!

Good stuff already. But you don't necessarily have to start by testing to destruction. Do some hammock design. Think about your tech stack. What do you think will be the limiting factors under typical user load? The next step is to simulate that load, and address any failures.

go work at a company and team that does software at scale.

Design for 10x scale to start with.

Wait until your site goes down due to unanticipated bottlenecks.

Fix bottlenecks.


So essentially, you learn by failing? that used to get you killed in the caveman age, and these days, that tends to get you fired.

These are the sorts of things that you have to learn via 2nd hand experience i reckon.

its not quite that bad really. unless you are scaling amazingly fast, like say twitter did, usually you get site slowness that you can troubleshoot and fix, not a completely busted website. my suggestion is to read some books/websites on the subject, and make sure you architect early to accomodate scaling later.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact