Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How to learn web scalability?
42 points by syberslidder on Aug 4, 2012 | hide | past | favorite | 19 comments
Hey fellow hackers,

I am a programmer, but most of my work doesn't involve the web. The most I have done with web design is an EC2 instance running a LAMP stack. I am working on a project right now, but my concern is that I am limited by my 1 server experience? I know the gist of scaling, (traffic routing, memcaching, sharding) but I don't really know how to go about setting it up. How do you even predict if you need to scale like that? Can scaling be done in a modular fashion? I want to learn about web scale :) Any pointers in the right direction would be greatly appreciated. If you could mention the full stack solution and not just one or two technologies that would be good.

Thanks in advance!




The best way to learn web scalability is to work at companies which have serious scaling problems. They tend to both a) develop a lot of organic knowledge about scaling, including by pushing the state of the art forward and b) they almost definitionally have more money than God, and as a consequence can spend copious money on hiring and training.

One thing companies with serious scaling problems will do is send you to places like e.g. JavaOne (and many, many more specialized conferences besides), where in the presentations you'll listen to e.g. LinkedIn talk about how to make writes immediately consistent to the originating user and then fan them out to other users over the course of the next few minutes, and what this requirement does to a three-tier architecture.

There's also a lot online. High Scalability will get mentioned, and I tend to keep my ears to the ground for interesting conferences and then see if they post videos and/or slide decks (SlideShare is wonderful for these).

OK, that's the answer you want. Here's the answer you need: you are overwhelmingly unlikely to have scaling problems. Single servers are big-n-beefy things, even single VPSes. Depending on what you're doing with it, a single EC2 instance running a non-optimized LAMP stack can easily get you to hundreds of thousands of users served, and/or millions of dollars of revenue. The questions of a) where do I get hundreds of thousands of users? or b) how do I convince businesses to pay $100k a month for my software? are of much more immediate and pressing concern to you than how you would take that very successful business to 10x or 100x its hypothetical size.

(Much like "I'm concerned my experience with managing a household budget will not prepare me for the challenges posed by a 9 figure bank account after I exit. How will I ever cope? Where can I learn about the tax challenges and portfolio allocation strategies of the newly wealthy?" the obvious answer sounds like "You'll hire someone to do that for you. Now, get closer to exiting than 'Having no product and no users' and worry about being a hundred-millionaire some other day.")


I don't do web scalability (I'm in HPC), but being a sysadmin, doing more with less is the name of the game.

I think the absolute best answer is think about it now, but worry about it when you get there

That is to say, you should be conscious of how your decisions will affect you down the road, but you ultimately have to make a risk assessment whether to pay now or pay later. How likely, really, are you going to run into that problem later?

When you do run into scalability problems, you have a very clear bottleneck. If you don't, spend a few hours figuring out what the problem is. Fix that first. This is going to shift your bottleneck. CPU, Network, Disk. They're just going to rotate around as you fix one, the next one becomes the issue.

You can't ever fix all your bottlenecks, you can just shift them around until it's "good enough." Fixing it more won't justify the cost, so you just deal with how it is, or you bring in a new solution that promises to solve your problems.

Put it this way -- If your first problem is how to optimize your quarter of a million dollar piece of equipment, you're probably in a pretty good spot. Otherwise, get there first.


>how to make writes immediately consistent to the originating user and then fan them out to other users over the course of the next few minutes

How is this done?


Well, the means I know of is affinity: You serve the user some kind of session identifier, like a cookie, or special URLs that encode the session ID. Then, when the user PUTs or POSTs something, you instruct the various layers of your stack to read that session identifier and route based on it, directing that user's requests to a specific machine for a while – the one that is most likely to be up-to-date on what that user has recently done – instead of randomly assigning each request to a different machine.

Then you put a timeout on the identifier so that the user eventually goes back to using randomly-assigned machines. And perhaps you try to ensure that the system doesn't break too badly if the user submits a write to a machine that promptly goes offline for a while. And you pray that there are no second-order effects. And, like everything else you do, this makes your caching strategy more complicated: Can you afford to serve a given request from cache without thinking, or do you need to waste time processing a session cookie, and then perhaps waste more time and RAM by having your back-end generate a page that could just as well have been served from cache after all?


That explains how you serve the same data to that user, but the more interesting problem is: how do you propagate the changes to the rest of the users over time?

Do you run some sort of backlog daemon that moves stuff between the different shards? And how is that faster than doing so immediately (the same data would be moved either way...)?


Scalability isn't about it being faster, just "able to be scaled". Sometimes it's about being able to make it faster by adding hardware and configuration.


The generic answer must of course be: It depends on your architecture.

My answer assumed that we were talking about replication lag, where you are copying data among the back-end data stores "immediately", which is to say "quickly, hopefully on the order of milliseconds but you can't count on that". Unfortunately, depending on the design of your database, replication is often not "immediate" enough to guarantee consistent results to one user even under the best conditions, let alone when the replication slows down or breaks, and that's why you might build something like the affinity system I just outlined.

The actual mechanism of replication depends on the data store. MySQL replication involves having a background daemon stream the binary log of database writes to another daemon on a slave server which replays the changes in order. (Of course, multiple clients can write to MySQL at a time, and ordering all those writes into a consistent serial stream is MySQL's job.) Other replication schemes involve having the client contact multiple servers and write directly to each one, aborting with an error if a critical number of those writes are not acknowledged as complete - and then there presumably needs to be some kind of janitorial process that cleans up any inconsistencies afterward. (I'm no CS genius; consult your local NoSQL guru for more information. Needless to say, you should strive to buy this stuff wrapped up in a box rather than build it yourself.)

As for "shards", that's different. My understanding of the notion of a "shard", as distinct from a "mirror" or a "slave", is that it's a mechanism for partitioning writes. If you've built a system with N servers but every write has to happen on each one, the throughput will be no higher than on a system with one server and one write, and you're doing "sharding" wrong. What you wanted was mirroring, and hopefully you weren't expecting that to help scale your write traffic.

If data does need to be spread around from shard to shard, hopefully it has to happen eventually rather than quickly, and some temporary inconsistencies from shard to shard are not a big problem. So you can use something like a queueing system: If you have some piece of data that needs to be mailed to N databases, you put it on the queue, and later a worker comes along and takes it off the queue and tries to write it to all N databases, and when it fails with one or two of the databases it puts the task back on the queue to try again later. (Or, perhaps, you put N messages on the queue, one per database, and run one worker for each database.) You'll also need a procedure or tool to deal with stale tasks, and overflowing queues, and the occasional permanent removal of databases from the system.

This soliloquy is looking long enough, now, so it seems like a good time to reiterate patio11's point: Don't build more scaling than you need right now. Hire someone to convince you not to design and build these things.


Thanks, this makes sense. I've also heard host affinity referred to as stickiness before.


High Scalability is a great blog with lots of case studies and best practices.

http://highscalability.com/start-here/


Lots of interesting reading on this site


Scalabilty is like sex. You can spend years thinking about it, reading about it, watching the videos and even practicing on your own...but nothing beats actually doing it. And the more you work at scale, the better you get at anticipating and troubleshooting the problems that come with increasing demands.


Slightly dated, but I recommend Cal's book on the lessons we learned scaling Flickr, http://www.amazon.com/Building-Scalable-Web-Sites-Applicatio...

Also remember that premature scaling is one of the leading causes of failure.


I'd like to recommend "Web Operations" by Allspaw

http://www.amazon.com/Web-Operations-Keeping-Data-Time/dp/14...


Well its easier to say what not to do than what to do, so here's some of that:

- cargo-cult buzzphrase/keyword engineering: saying "we should use <thing>, <cool_company_a> and <cool_company_b> use <thing>". If you find yourself debating technology where entire paragraphs go by without real metrics, then you're really social-signaling not engineering.

- scaling for the sake of scaling: there is a vast grab bag of available tools and techniques to scale, but for any given business the right answer is to take a pass on most of it. the overwhelming majority will either be unnecessary or even flat out counter-productive. the easiest way to separate what to scale vs. what to hack/ignore is asking yourself "is this what my users/customers love/pay me for, or something a grad student would love to spend a semester on".

- timing context matters, a lot: whats "right" for scaling a young company with very few engineers (and probably zero ops pros) will be a different answer than whats "right" for scaling a millions-of-users/millions-in-revenue company. whats right for twitter is wrong for a young company, almost by definition.

- "throw hardware at it" (the wrong way): throwing hardware at a problem is the best way to solve scaling problems, but only if you do it right. stateless request-response across large pools of identical servers scales better than anything else. however "give it a dedicated server" leaves you with a mine field of one and two-off setups that only scales into the dozens-of-servers before it starts to choke your business.


Udacity's Web Application Engineering course is pretty good.

http://www.udacity.com/view#Course/cs253/CourseRev/apr2012/U...

Particularly, Unit 6 and Unit 7 talk about scaling.

The course uses GAE, but the concepts apply everywhere. The professor is Steve Huffman who has practical experience scaling Reddit and now Hipmunk.


Learn by doing. And don't worry too much about it until the need comes. Luckily, the problem itself implies you're doing well, so the effort of dealing with it when the time comes will pay for itself.


Grow a site to tens of millions of users. Spend nights and weekends for several years keeping it up.

:) Otherwise get a job at some company that is doing that presently.


buro9 answered this question very well, and I just link to his comment whenever it comes up again: http://news.ycombinator.com/item?id=2249789


I might be showing a lack of knowledge here, because I doubt I'm as well-equipped to answer this as others on this thread, but here are my thoughts.

Learn functional programming, because it's going to become important when parallelism becomes important. Learn enough about concurrency and parallelism to have a good sense of the various approaches (threads, actors, software transactional memory) and the tradeoffs. Learn what databases are and why they're important. Learn about relational (SQL) databases and transactions and ACID and what it means not to be ACID-compliant (and why that can be OK). Learn a NoSQL database. I frankly don't much like most of them, but they solve an important problem that relational databases need a lot more massaging to attack.

All this said, I think focusing on "web scalability" is the wrong approach. Focus on the practical half of computer science rather than "scalability". I feel like "scaling" is, to a large degree, a business wet dream / anxious nightmare associated with extremely public, breakout success (or catastrophic, embarrassing failure) and that most people would do better just to learn the fundamentals than to have an eye on "scaling" for it's own sake.

Bigness is technology isn't innately good. All else being equal, it's very, very bad. Sometimes the difficulty is intrinsic and that's a good thing, because it means you're solving a hard problem, but difficulty for difficulty's sake is a bad pursuit. Software is already hard; no point in making it harder.

Finally, learn the Unix philosophy and small-program architecture. Learn how to design code elegantly. Learn why object-oriented programming (as seen over the past 25 years) is a massive wad of unnecessary complexity and a trillion-dollar mistake. Then learn what OOP looks like when done right, because sometimes it really delivers. For that, most scaling problems come from object-oriented Big Code disasters that are supposed to perform well on account of all the coupling, but end up being useless because no one can understand them or how they work.

Learn the fundamentals so you can scale, but don't scale prematurely for its own sake.

I could be wrong, but I don't think I am.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: