
Ask HN: How did you learn about the design/architecture of backend systems ? - yr
I'm interested in learning the backend systems of Facebook, Amazon, Google, Twitter, Loopt or any other startups. I'm mainly interested in design/architecture and scaling.
======
thejo
For web applications, understanding the concept of state and managing it takes
you a long way towards building a scalable system (on a related note, any time
spent really understanding HTTP is time well spent). It's very useful to read
books, presentations and war stories, but for me, reading code in applications
designed for scale was far more important. I strongly recommend the Amazon
Dynamo paper. Read and think about the design decisions they made. Then go
read the Cassandra code for the actual implementation (pretty much, in terms
of the big ideas, at least as it was back in 2008).

As with any learning it's very hard to truly grok the importance of loose
coupling, service oriented architecture, embedding statelessness in the
architecture etc., unless you try to build something, get it wrong and then
fix it. So, the very best way to learn would be to put yourself in a place
where you get an opportunity to do that.

------
pkghost
My answer comes from the perspective of a recent grad who has spent a year at
a mid-sized SF social gaming company, working only recently on the back-end
(i.e., not a scaling expert).

A few things I've learned w/respect to scaling in my context:

\- I/O is likely to be your bottleneck, so design your db well and anticipate
splitting it across multiple machines (and what that means for your access
logic)

\- don't spawn a new thread for every request (a la apache)

\- keep your services simple (perhaps: one for handling web requests, one for
data access, one for caching, one for your payment system) and their
relationships even simpler (one hop max from RPC caller to callee)

\- cache like a hoarder

\- find/write an efficient serializer for RPCs between services

What helped me get a grip on this stuff was sitting down with an architect who
has done it successfully multiple times. I asked for an elevator pitch
description of scaled web architectures, and then ask him about his failures.
Voila bullet points.

~~~
davidw
Apache does not spawn a new thread for every request.

~~~
jaydub
Every _request_ does not require a thread to be spawned (depends on
configuration/load), but every _connection_ would require its own thread.

"The worker MPM uses multiple child processes with many threads each. _Each
thread handles one connection_ at a time. Worker generally is a good choice
for high-traffic servers because it has a smaller memory footprint than the
prefork MPM.

The prefork MPM _uses multiple child processes with one thread each. Each
process handles one connection at a time._ On many systems, prefork is
comparable in speed to worker, but it uses more memory. Prefork's threadless
design has advantages over worker in some situations: it can be used with non-
thread-safe third-party modules, and it is easier to debug on platforms with
poor thread debugging support."

<http://httpd.apache.org/docs/2.0/misc/perf-tuning.html>

~~~
pkghost
thanks for the explanation :)

------
burel
here an article about various techniques to make github (a rails app) fast:

<http://github.com/blog/530-how-we-made-github-fast>

very interesting but also full of intimidating stuff:

"[...] For requests to the main website, the load balancer ships your request
off to one of the four frontend machines. Each of these is an 8 core, 16GB RAM
bare metal server. Their names are fe1, …, fe4. Nginx accepts the connection
and sends it to a Unix domain socket upon which sixteen Unicorn worker
processes are selecting. One of these workers grabs the request and runs the
Rails code necessary to fulfill it. [...]"

I think they know what they are doing ...

------
mindcrime
I'm still not the world's foremost expert, but what I do know I've learned
through a combination of trial and error, reading books (I'll edit this later
and put in a couple of specific titles), reading stuff on the 'Net and classes
I took in school (I did a degree in "High Performance Computing" which had
some useful aspects to it).

A good place to start, if you're not already familiar with it, is High
Scalability: <http://highscalability.com/>

Edit: book recommendations:

Scalable Internet Architectures -

[http://www.amazon.com/Scalable-Internet-Architectures-
Theo-S...](http://www.amazon.com/Scalable-Internet-Architectures-Theo-
Schlossnagle/dp/067232699X)

Linux Clustering - Building and Maintaining Linux Clusters -
[http://www.amazon.com/Linux-Clustering-Building-
Maintaining-...](http://www.amazon.com/Linux-Clustering-Building-Maintaining-
Clusters/dp/1578702747/ref=sr_1_1?ie=UTF8&s=books&qid=1279001787&sr=1-1)

High Performance Linux Clusters - [http://www.amazon.com/Performance-Clusters-
OpenMosix-Nutshel...](http://www.amazon.com/Performance-Clusters-OpenMosix-
Nutshell-
Handbooks/dp/0596005709/ref=sr_1_2?ie=UTF8&s=books&qid=1279001787&sr=1-2)

Linux Enterprise Cluster - [http://www.amazon.com/Linux-Enterprise-Cluster-
Available-Com...](http://www.amazon.com/Linux-Enterprise-Cluster-Available-
Commodity/dp/1593270364/ref=sr_1_3?ie=UTF8&s=books&qid=1279001787&sr=1-3)

Java Message Service - [http://www.amazon.com/Java-Message-Service-Mark-
Richards/dp/...](http://www.amazon.com/Java-Message-Service-Mark-
Richards/dp/0596522045/ref=sr_1_1?ie=UTF8&s=books&qid=1279001930&sr=1-1)

Java Message Service API Tutorial and Reference - [http://www.amazon.com/Java-
Message-Service-Tutorial-Referenc...](http://www.amazon.com/Java-Message-
Service-Tutorial-
Reference/dp/0201784726/ref=sr_1_4?ie=UTF8&s=books&qid=1279001995&sr=1-4)

Enterprise JMS Programming - [http://www.amazon.com/Enterprise-JMS-
Programming-Professiona...](http://www.amazon.com/Enterprise-JMS-Programming-
Professional-Mindware/dp/0764548972/ref=pd_cp_b_2)

Hadoop: The Definitive Guide - [http://www.amazon.com/Hadoop-Definitive-Guide-
Tom-White/dp/0...](http://www.amazon.com/Hadoop-Definitive-Guide-Tom-
White/dp/0596521979/ref=sr_1_1?ie=UTF8&s=books&qid=1279002071&sr=1-1)

Pro Hadoop - [http://www.amazon.com/Pro-Hadoop-Jason-
Venner/dp/1430219424/...](http://www.amazon.com/Pro-Hadoop-Jason-
Venner/dp/1430219424/ref=pd_sim_b_2)

Wikipedia:

<http://en.wikipedia.org/wiki/Shared_nothing_architecture>

[http://en.wikipedia.org/wiki/Shard_%28database_architecture%...](http://en.wikipedia.org/wiki/Shard_%28database_architecture%29)

<http://en.wikipedia.org/wiki/Publish/subscribe>

It's important to understand the difference between vertical scaling and
horizontal scaling. Horizontal is very en vogue these days, especially with
commodity hardware. Why? Because you can add power incrementally without
spending tons of money upfront, and without requiring a "forklift upgrade"
(that is a reference to needing a forklift to bring in a new mainframe or
minicomputer). This is a pretty good article on the topic:

[http://www.scalingout.com/2007/10/vertical-scaling-vs-
horizo...](http://www.scalingout.com/2007/10/vertical-scaling-vs-horizontal-
scaling.html)

As popular as horizontal scaling is, don't ignore the possibilities of going
to bigger hardware though. It has it's own advantages, especially when you
start talking about physical floor space to store servers.

Of course "cloud computing" changes some of this, both by making it cheap and
easy to add VPS's to scale horizontally, or by making it possible (sometimes)
to easily add more processing power, RAM, etc. to your "server." Read up on
Xen, KVM, EC2, etc. for more on that whole deal.

~~~
mindcrime
Forgot to include it originally, but it would probably be good to study
Map/Reduce as well:

<http://en.wikipedia.org/wiki/MapReduce>

<http://labs.google.com/papers/mapreduce.html>

Caching is huge too... IO is expensive, RAM access is cheap. The more you can
pre-load, pre-calculate, and/or pre-sort stuff and store it in memory, the
better (in terms of avoiding expensive IO anyway). Caching has it's own issues
though: if you cache so aggressively that you exhaust physical ram and cause
more swapping, you can actually hurt yourself. Also, you have to deal with the
possibility of stale data in the cache, and determining when and how to expire
and reload items in the cache. But still, caching is essential, it's just not
necessarily _easy_.

Also, for perspective if nothing else, read the papers and stuff on SEDA
(Staged Event Driven Architecture). There's still debate about how effective
the SEDA approach is, but reading the discussion(s) will help you appreciate
the issues involved. <http://www.eecs.harvard.edu/~mdw/proj/seda/>

------
simonw
I found Cal Henderson's book "Building Scalable Websites" (which describes
pretty much everything he learnt while scaling Flickr) incredibly useful.

------
jarsj
Go work at Google. Even if you have to pay for it.

