
Ask HN: How to learn web scalability? - syberslidder
Hey fellow hackers,<p>I am a programmer, but most of my work doesn't involve the web. The most I have done with web design is an EC2 instance running a LAMP stack. I am working on a project right now, but my concern is that I am limited by my 1 server experience? I know the gist of scaling, (traffic routing, memcaching, sharding) but I don't really know how to go about setting it up. How do you even predict if you need to scale like that? Can scaling be done in a modular fashion? I want to learn about web scale :) Any pointers in the right direction would be greatly appreciated. If you could mention the full stack solution and not just one or two technologies that would be good.<p>Thanks in advance!
======
patio11
The best way to learn web scalability is to work at companies which have
serious scaling problems. They tend to both a) develop a lot of organic
knowledge about scaling, including by pushing the state of the art forward and
b) they almost definitionally have more money than God, and as a consequence
can spend copious money on hiring and training.

One thing companies with serious scaling problems will do is send you to
places like e.g. JavaOne (and many, many more specialized conferences
besides), where in the presentations you'll listen to e.g. LinkedIn talk about
how to make writes immediately consistent to the originating user and then fan
them out to other users over the course of the next few minutes, and what this
requirement does to a three-tier architecture.

There's also a lot online. High Scalability will get mentioned, and I tend to
keep my ears to the ground for interesting conferences and then see if they
post videos and/or slide decks (SlideShare is wonderful for these).

OK, that's the answer you want. Here's the answer you need: you are
overwhelmingly unlikely to have scaling problems. Single servers are big-n-
beefy things, even single VPSes. Depending on what you're doing with it, a
single EC2 instance running a non-optimized LAMP stack can easily get you to
hundreds of thousands of users served, and/or millions of dollars of revenue.
The questions of a) where do I get hundreds of thousands of users? or b) how
do I convince businesses to pay $100k a month for my software? are of much
more immediate and pressing concern to you than how you would take that very
successful business to 10x or 100x its hypothetical size.

(Much like "I'm concerned my experience with managing a household budget will
not prepare me for the challenges posed by a 9 figure bank account after I
exit. How will I ever cope? Where can I learn about the tax challenges and
portfolio allocation strategies of the newly wealthy?" the obvious answer
sounds like "You'll hire someone to do that for you. Now, get closer to
exiting than 'Having no product and no users' and worry about being a hundred-
millionaire some other day.")

~~~
bbq
>how to make writes immediately consistent to the originating user and then
fan them out to other users over the course of the next few minutes

How _is_ this done?

~~~
mechanical_fish
Well, the means I know of is affinity: You serve the user some kind of session
identifier, like a cookie, or special URLs that encode the session ID. Then,
when the user PUTs or POSTs something, you instruct the various layers of your
stack to read that session identifier and route based on it, directing that
user's requests to a specific machine for a while – the one that is most
likely to be up-to-date on what that user has recently done – instead of
randomly assigning each request to a different machine.

Then you put a timeout on the identifier so that the user eventually goes back
to using randomly-assigned machines. And perhaps you try to ensure that the
system doesn't break too badly if the user submits a write to a machine that
promptly goes offline for a while. And you pray that there are no second-order
effects. And, like everything else you do, this makes your caching strategy
more complicated: Can you afford to serve a given request from cache without
thinking, or do you need to waste time processing a session cookie, and then
perhaps waste _more_ time and RAM by having your back-end generate a page that
could just as well have been served from cache after all?

~~~
zxoq
That explains how you serve the same data to that user, but the more
interesting problem is: how do you propagate the changes to the rest of the
users over time?

Do you run some sort of backlog daemon that moves stuff between the different
shards? And how is that faster than doing so immediately (the same data would
be moved either way...)?

~~~
mechanical_fish
The generic answer must of course be: It depends on your architecture.

My answer assumed that we were talking about replication lag, where you are
copying data among the back-end data stores "immediately", which is to say
"quickly, hopefully on the order of milliseconds but you can't count on that".
Unfortunately, depending on the design of your database, replication is often
not "immediate" enough to guarantee consistent results to one user even under
the _best_ conditions, let alone when the replication slows down or breaks,
and that's why you might build something like the affinity system I just
outlined.

The actual mechanism of replication depends on the data store. MySQL
replication involves having a background daemon stream the binary log of
database writes to another daemon on a slave server which replays the changes
in order. (Of course, multiple clients can write to MySQL at a time, and
ordering all those writes into a consistent serial stream is MySQL's job.)
Other replication schemes involve having the client contact multiple servers
and write directly to each one, aborting with an error if a critical number of
those writes are not acknowledged as complete - and then there presumably
needs to be some kind of janitorial process that cleans up any inconsistencies
afterward. (I'm no CS genius; consult your local NoSQL guru for more
information. Needless to say, you should strive to buy this stuff wrapped up
in a box rather than build it yourself.)

As for "shards", that's different. My understanding of the notion of a
"shard", as distinct from a "mirror" or a "slave", is that it's a mechanism
for partitioning writes. If you've built a system with N servers but every
write has to happen on each one, the throughput will be no higher than on a
system with one server and one write, and you're doing "sharding" wrong. What
you wanted was mirroring, and hopefully you weren't expecting that to help
scale your write traffic.

If data does need to be spread around from shard to shard, hopefully it has to
happen eventually rather than quickly, and some temporary inconsistencies from
shard to shard are not a big problem. So you can use something like a queueing
system: If you have some piece of data that needs to be mailed to N databases,
you put it on the queue, and later a worker comes along and takes it off the
queue and tries to write it to all N databases, and when it fails with one or
two of the databases it puts the task back on the queue to try again later.
(Or, perhaps, you put N messages on the queue, one per database, and run one
worker for each database.) You'll also need a procedure or tool to deal with
stale tasks, and overflowing queues, and the occasional permanent removal of
databases from the system.

This soliloquy is looking long enough, now, so it seems like a good time to
reiterate patio11's point: _Don't build more scaling than you need right now.
Hire someone to convince you not to design and build these things._

------
arkitaip
High Scalability is a great blog with lots of case studies and best practices.

<http://highscalability.com/start-here/>

~~~
DevAccount
Lots of interesting reading on this site

------
relaunched
Scalabilty is like sex. You can spend years thinking about it, reading about
it, watching the videos and even practicing on your own...but nothing beats
actually doing it. And the more you work at scale, the better you get at
anticipating and troubleshooting the problems that come with increasing
demands.

------
kellanem
Slightly dated, but I recommend Cal's book on the lessons we learned scaling
Flickr, [http://www.amazon.com/Building-Scalable-Web-Sites-
Applicatio...](http://www.amazon.com/Building-Scalable-Web-Sites-
Applications/dp/0596102356)

Also remember that premature scaling is one of the leading causes of failure.

------
atsaloli
I'd like to recommend "Web Operations" by Allspaw

[http://www.amazon.com/Web-Operations-Keeping-Data-
Time/dp/14...](http://www.amazon.com/Web-Operations-Keeping-Data-
Time/dp/1449377440)

------
cagenut
Well its easier to say what not to do than what to do, so here's some of that:

\- cargo-cult buzzphrase/keyword engineering: saying "we should use <thing>,
<cool_company_a> and <cool_company_b> use <thing>". If you find yourself
debating technology where entire paragraphs go by without real metrics, then
you're really social-signaling not engineering.

\- scaling for the sake of scaling: there is a vast grab bag of available
tools and techniques to scale, but for any given business the right answer is
to take a pass on most of it. the overwhelming majority will either be
unnecessary or even flat out counter-productive. the easiest way to separate
what to scale vs. what to hack/ignore is asking yourself "is this what my
users/customers love/pay me for, or something a grad student would love to
spend a semester on".

\- timing context matters, a lot: whats "right" for scaling a young company
with very few engineers (and probably zero ops pros) will be a different
answer than whats "right" for scaling a millions-of-users/millions-in-revenue
company. whats right for twitter is _wrong_ for a young company, almost by
definition.

\- "throw hardware at it" (the wrong way): throwing hardware at a problem is
_the best_ way to solve scaling problems, but only if you do it right.
stateless request-response across large pools of identical servers scales
better than anything else. however "give it a dedicated server" leaves you
with a mine field of one and two-off setups that only scales into the dozens-
of-servers before it starts to choke your business.

------
sk55
Udacity's Web Application Engineering course is pretty good.

[http://www.udacity.com/view#Course/cs253/CourseRev/apr2012/U...](http://www.udacity.com/view#Course/cs253/CourseRev/apr2012/Unit/366003/Nugget/525001)

Particularly, Unit 6 and Unit 7 talk about scaling.

The course uses GAE, but the concepts apply everywhere. The professor is Steve
Huffman who has practical experience scaling Reddit and now Hipmunk.

------
SeoxyS
Learn by doing. And don't worry too much about it until the need comes.
Luckily, the problem itself implies you're doing well, so the effort of
dealing with it when the time comes will pay for itself.

------
justin
Grow a site to tens of millions of users. Spend nights and weekends for
several years keeping it up.

:) Otherwise get a job at some company that is doing that presently.

------
eddie_the_head
buro9 answered this question very well, and I just link to his comment
whenever it comes up again: <http://news.ycombinator.com/item?id=2249789>

------
michaelochurch
I might be showing a lack of knowledge here, because I doubt I'm as well-
equipped to answer this as others on this thread, but here are my thoughts.

Learn functional programming, because it's going to become important when
parallelism becomes important. Learn enough about concurrency and parallelism
to have a good sense of the various approaches (threads, actors, software
transactional memory) and the tradeoffs. Learn what databases are and why
they're important. Learn about relational (SQL) databases and transactions and
ACID and what it means not to be ACID-compliant (and why that can be OK).
Learn a NoSQL database. I frankly don't much like most of them, but they solve
an important problem that relational databases need a lot more massaging to
attack.

All this said, I think focusing on "web scalability" is the wrong approach.
Focus on the practical half of computer science rather than "scalability". I
feel like "scaling" is, to a large degree, a business wet dream / anxious
nightmare associated with extremely public, breakout success (or catastrophic,
embarrassing failure) and that most people would do better just to learn the
fundamentals than to have an eye on "scaling" for it's own sake.

Bigness is technology isn't innately good. All else being equal, it's very,
very bad. Sometimes the difficulty is intrinsic and that's a good thing,
because it means you're solving a hard problem, but difficulty for
difficulty's sake is a bad pursuit. Software is already hard; no point in
making it harder.

Finally, learn the Unix philosophy and small-program architecture. Learn how
to design code elegantly. Learn why object-oriented programming (as seen over
the past 25 years) is a massive wad of unnecessary complexity and a trillion-
dollar mistake. Then learn what OOP looks like when done right, because
sometimes it really delivers. For that, most scaling problems come from
object-oriented Big Code disasters that are supposed to perform well on
account of all the coupling, but end up being useless because no one can
understand them or how they work.

Learn the fundamentals so you _can_ scale, but don't scale prematurely for its
own sake.

I could be wrong, but I don't think I am.

