
Notes on Distributed Systems for Young Bloods - jcdavis
http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/
======
btilly
Great article. There are just a few things that I would add to it.

1\. The metrics comment could not be more true. You cannot think hard enough
about the actual problems you will encounter and what metrics you will have.
Have a common way to pull information about each service, and to analyze it in
some common place.

2\. Standardize, standardize, standardize. When someone is answering a page at
2 AM and has to deal with a component of your system that they don't really
know, the more it resembles other components, the better. Think hard about how
you can get every service to be written in such a way that the same critical
information is available. Where is it documented? Who do I page? Where is the
monitoring? Yes, random third party components won't follow your standards,
and will be hard to integrate. But standardization is a good thing.

3\. You need to be able to send canary requests through that trace through
your whole infrastructure. You should be able to flag a front end request, and
have every single request that it generates through your entire system be
logged somewhere so that you can see a breakdown of what happened. Hide this
so that nobody can use it to take your site down, but build the capacity for
yourself. (Standardization will make such a system much easier to build!)

4\. Randomly canary a small fraction of your traffic. There is tremendous
value in having a random sample of traced traffic. When you're trying to
understand how things work, there is nothing like taking an actual request
going through a complex system, and seeing what it did. Furthermore if you've
got intermittent problems for a small fraction of users, being able to look at
a random slow request really, really, really helps you track down issues that
otherwise would be virtually impossible to replicate.

------
chrislloyd
One of the most shameful moments of my career happened on the Saturday night
before WWDC in 2010. I'd never worked on a system with more than one component
or more than 20,000 lines of code. Drunk, I managed to stumble from Jeff's
hot-tub to his roof and told him that Twitter's scaling problems were "stupid"
and that "there were no good reasons why Twitter should go down".

I've apologized to him before, but after having worked on Minefold for the
last 2 years, I feel like I need to apologize again. The work that Jeff and
the others at Twitter have done has been amazing. I was also a massive cock.

~~~
jmhodges
This really did happen. And we're totally cool.

I refer to that as the Night of a Thousand Australians. Good times.

------
ef4
There's a growing area where I see a lot of people starting to get involved
with distributed system who didn't have to deal with it before: rich in-
browser apps with their own permanent storage.

Once you have a Javascript application with state speaking over one or more
APIs to your backend services, you're in the domain of distributed system
design. Especially if you use the application cache and support offline
operation. I frequently see people who are used to more traditional web
applications underestimate the system design challenges this causes.

(Those challenges are totally worth solving, because the model is very
powerful.)

~~~
chewxy
curious question: how do you get people to even click "ALLOW" when the popup
says "do you want to allow this website to use localStorage"

~~~
johanhil
What browser does this? I've never run in to it.

~~~
anarchitect
It happens in mobile Safari IIRC.

------
dkhenry
Having done distributed systems for many years the only thing I would like to
add is that all systems are not equal so some of the things mentioned in this
article might not work. It was my experience that the real key to making a
distributed system work was to identify the critical paths and segment and
encapsulate the functionality of different components into the smallest
possible implementations. Then define robust messaging between the components
to allow for coordination. In my case we were making control systems for Naval
systems so we had a much different set of requirements then websites, but it
was a massively distributed system ( over 10K nodes ) with a strong
requirement for redundancy and latency ( no ethernet here just field bus
networks ). I even wrote and published a paper on it some time back. Point is
this is not the only path there are other ways to implement distributed
systems and there is still much to be discovered in the field.

~~~
Scramblejams
Link to your paper?

------
topfunky
Really great article...an instant classic.

 _Shameless plug_ It's validating that we mentioned over half of these points
in an interview with startup founder Eric Lindvall on his Papertrailapp.com
log management app: <https://peepcode.com/products/up-lindvall>

------
ChuckMcM
that is an awesome article. And if you're paying attention a great way to
interview potential engineers for your org who will be asked to build
distributed systems. At Google one of the engineers was asked if the system
they were building was 'mostly reliable' and his answer was great, he said
"No, I assume that every computer this runs on is trying to bring down the
service, and the service tries to dodge in such a way that it stays up
anyway."

------
FourthProtocol
The timing of this one is scary.

I'm working on a mobile app that does a key exchange with a server before
allowing a server-based registration or login. It's nowhere near as complex as
your average distributed system.

That said, I've run into a scary amount of the things mentioned in this
article in my tiny little use case. Just trying to ensure a decent user
experience (timing out a comms check after two seconds rather than waiting up
to 60 seconds when the phone switches from networked to disconnected) in an
async message exchange needs some crazy orchestration. Keeping the code clean
means refactoring stuff I thought I had nailed two months ago.

I've been programming for a long time, but this stuff humbles me. And happily,
I love it.

~~~
pm90
Just a heads up: here's a course taught by Robert Morris that goes through a
lot of great stuff:

<http://pdos.csail.mit.edu/6.824/>

~~~
FourthProtocol
Going through it now!

------
jreichhold
Very glad that Jeff wrote this up and he deservers mad credit for documenting
something that has been frustrating me for years. People don't realize how
hard doing this right is and discount the hard work that goes into making
large systems scale in a stable manner.

It isn't just load testing but more that the whole system should be considered
suspect. If you don't act defensively at all steps you will be hosed by
something you thought will never happen. Just had a good talk about this last
weekend. Memory, TCP, and all other rock-solid things can and will have issues
in large systems.

------
rdtsc
> Implement backpressure throughout your system.

This is perhaps a point one cannot get to theoretically just sort of thinking
about. This realization comes after observing the effects in practice. It is
interesting to observe distributed systems, especially the ones that rely on
asynchronous messages, thing go wrong and queues start filling up. Or even
weirder machines start to synchronize for some reason. Like the size of the
queues will oscillate in resonance of some sorts.

One way out is to reduce asynchronicity but that comes with serious
performance penalties.

~~~
benjaminwootton
I view this the opposite way around. Message queues and asynchronicity are an
alternative to backpressure as described.

If I synchronously hit some service then I need to build facilities into the
service to say that it's too heavily loaded, and perhaps handle those
responses on the client.

If I pop some job onto a message queue or asynchronous endpoint, things will
just slow down for a while. Which is normally as good a way as any as soaking
up load.

~~~
squarecog
How do you keep the message queue from unbounded growth?

~~~
jreichhold
Arrival rate and dequeue/processing time are the keys here. Unbounded queues
are easy.. Just limit the queue size. Of course this isn't what the user of
the service wants and leads to a very bad experience

------
berlinbrown
Does distributed always mean a complex multi-tier web application?

~~~
epenn
No. That's only one example, depending on how the website is setup, but
they're not limited to that. A distributed system is one that requires
coordination among multiple machines to execute a task or set of tasks. A few
examples:

\- A website using a load balancer to offset service to multiple machines

\- A hadoop cluster running a series of mapreduce tasks

\- A botnet setup to DDOS some target

------
biscarch
Great article. I've just recently started exploring distributed systems
(specifically with Riak Core/Erlang) and I hit upon a solution to a problem
I've been working on (I think) while reading.

------
saosebastiao
It was a great article but this sentence had me a bit confused: "Well, a
typical machine at the end of 2012 has 24 GB of memory, you’ll need an
overhead of 4-5 GB for the OS, another couple, at least, to handle requests,
and a tweet id is 8 bytes."

My Desktop has like 40 chrome processes, is running GeoServer on Tomcat
(currently idle, but whatever), and the entire Unity desktop environment. I'm
currently at 2.7GB. What kind of server OS needs 4-5GB?

~~~
peterwwillis
One that has buffers.

~~~
saosebastiao
Don't they all have buffers?

~~~
peterwwillis
Yes. Depending on load and services, servers require a great deal more in
buffers and cache than your desktop. Figure out the memory allocation for a
single TCP connection on a default Linux 3.0 kernel and multiply that times
(at least) 10k. Considering you probably want greatly expanded limits and bulk
transfer, you might want to multiply that times 10 or 100. That's just TCP
buffers for active sessions.

~~~
saosebastiao
Thanks for the explanation

------
andyzweb
there are only 2 problems in computer science.

0\. naming things

1\. cache invalidation

2\. off-by-one errors

~~~
drudru11
this is the truth

------
pm90
I initially replied to a comment, but I'll post this as a direct reply: Robert
Morris teaches a course on Distributed Systems that covers a lot of great
stuff; a lot of which is available online:

<http://pdos.csail.mit.edu/6.824/>

------
newobj
Great article; concise explanation of most of the points I would have expected
to be there. Especially some of the more subtle things like backpressure
everywhere, measure with percentiles, and the importance of being partially
available.

------
pm90
From the article: _When a system is not able to handle the failures of
another, it tends to emit failures to another system that depends on it_

------
ybaumes
Great article. Also I would like ton mention a point : implementing a "chaos
monkey". As Netflix team nails it __"We have found that the best defense
against major unexpected failures is to fail often." __[1]

[1] [[http://techblog.netflix.com/2012/07/chaos-monkey-released-
in...](http://techblog.netflix.com/2012/07/chaos-monkey-released-into-
wild.html)]

------
thomseddon
_People are less finicky than computers, even if their interface is a little
less standardized._

Brilliant, made me smile :)

------
lemming
_Metrics are the only way to get your job done_

God, I wish I'd known this sooner.

------
toolslive
Great read. You can feel the pain he must have suffered.

------
lallysingh
Man, good load testing never gets any respect.

------
martinced
_"If you can fit your problem in memory, it’s probably trivial."_

Intellectual honesty at its best...

