
Learning to Build Distributed Systems - SirOibaf
http://brooker.co.za/blog/2019/04/03/learning.html
======
i0exception
Being on-call/carrying the pager for a complex and unstable distributed system
is a great way to understand how to build a good one. Building a distributed
system isn't hard per-se. It's almost always the non-determinism and the
associated difficulty in debugging errors that's a problem.

~~~
gbuk2013
This has been my experience. The pieces are simple, but being to able to
understand what is going on in the whole system is not. Detailed, quality
structured logging is very important, sent to a central location that can be
queried (e.g Elastic Search / Kibana). I have come across several experienced
developers who really don’t like he idea of verbose logging but in my
experience it has been critically important for dealing with a mesh of micro
services.

~~~
alexbanks
"How does this even get invoked" and "Where would I go to find these logs" are
the two questions that, if the answers are not obvious/obviously documented,
there is a big problem with that particular system.

My current company for whatever reason does not value logging at all, to the
point where for lots of our systems there just aren't logs/nobody has ever
bothered to look for them. It's pretty astonishing to me, a person who has
valued logging as one of the highest-value mechanisms to increase tracibility.

~~~
gbuk2013
Prior to moving to software development, I worked 5 years in Support and
consequently I have a murderous hatred of anyone writing software that is hard
to debug and logging is a huge part of that.

I was actually greatly inspired by a product I used to support which had such
detailed logs that you could troubleshoot almost any issue from the (rather
huge) log dump. I try to do the same now.

~~~
yjftsjthsd-h
I really wish all developers had to do support at the beginning of their
careers and so they understand how to build things that the ops team can work
with. Devops/SREs are one approach to this, but I'd really like to see it be
more widespread.

------
wippler
I found "Designing Data-Intensive Application" book to be really helpful
introduction to learning more interesting details about distributed systems.
The book provides a gentle introduction to build intuition around these
systems and contains a plethora of links to go further down the rabbit hole.

[https://dataintensive.net/](https://dataintensive.net/)

~~~
barbecue_sauce
It's about 1000x better than "Designing Distributed Systems" which is
basically just a book about Kubernetes. (Should have known, since its written
by Brendan Burns).

------
healsjnr1
Aside from theoretic advice, one of the key things I've found euros best is
_not_ to set out to build a big complex system.

Solve small problems that deliver value in distributed way and then iterate
and expand. Doing this focuses engineering on the real problems that need to
be solved at the scale you are working at. This delivers real understanding of
what is going on, both theoretically and specially within your domain.

The retort to this is often you'll never get to scale, or that your
architecture will be incoherent. Basically you need to see the whole picture
and build towards it.

The caveat is that solving small problems doesn't mean 'do something quickly
or hacky'. If you do, sure you'll end up with an incoherent mess.

If you maintain good engineering practices, and focus them on real problems,
my experience is you will scale your platform. These practices will keep
focused on solving real problems properly, and doing so will give you an
understanding of what the next problem is going to be.

~~~
cbetti
The entry cost of introducing the first set of distributed applications is so
high you need backing.

------
gregorygoc
I was being bootstrapped into the new project (pretty big distributed system)
for 6 months since I went on call. But then my debugging skills skyrocketed,
and I know it’s a little inconvenient to wake up at 5 am because of pager duty
sad trombone sound, but I know it’s an investment on my part to become a
better distributed systems engineer.

------
ignoramous
Disclaimer: ex-AWS. Not for reproduction.

> There are a lot of distributed systems books I love, but I haven't found an
> accessible introduction I particularly like yet.

I'd highly recommend Martin Kleppmann's Desgining Data Intensive Applications
[0] as an introductory book on distributed systems.

> Here's Jaso Sorenson describing the design of DynamoDB

For the uninitiated, Jaso played a key role in S3's evolution and was one of
S3's founding engineers along with the then CTO and now Distinguished Eng,
Alan Vermeulen (who's probably Jeff Dean of Amazon), who btw is an expert
speaker. Such a shame folks outside of Amazon don't get to see his numerous
talks. Absolute legends.

> Colm MacCarthaigh talking about some principles for building control planes.

Jaso also built "one of the most complex distributed systems ever built at
AWS", what then later formed inspiration for AWS HyperPlane and the
NetworkLoadBalancer by a team led by Colm, who inturn, was a founding eng with
Route53 (first-ever 100% uptime public service at AWS?), VPC, and CloudFront
among other very fancy security things.

> I've (Marc Brooker) been doing this stuff for 15 years in one way or
> another, and still feel like I'm scratching the surface.

Well, he's being humble here. Marc's internal wiki pages on various designs
he's come up with over the years for AWS are absolute gold mines, in that they
explain his thought process, his experimentation, and his research with
existing publications. Along with Colm (and a few others), he's is a prolific
speaker internally at AWS with some of his talks ranked in the top 10
consistently. His recent contributions with EC2/EBS and Lambda mean he's
worked with the largest and most complicated distributed systems there ever
has been at AWS (imo, of course).

There are others like Eric Brandwine (Security), James Hamilton (Data
Centers), Peter Vosshall (Silk, EC2, and Architecture), Becky Weiss (VPC and
Lambda), David Yancek (IoT, NoSQL), Andrew Certain (DB), Tim Rath (DB),
Stefano Stefani (DB, Warehouse, AI), Brad M (S3), Nafea B (Hardware), MSW
(EC2/OS), A Ligouri (EC2/OS), Hall Cary (Builder Tools), Marvin T (Kinesis) et
al who don't get a mention in the blog post but have been every bit gigantic
in their contributions at AWS since forever. The recent recruits at the top
rungs of eng at AWS have considerable pedigree coming in, as well. Exciting
times, for sure.

I wish they'd release those internal videos (and CoEs) on a case-to-case basis
to public. That'd go a long way in contributing to the distributed systems
literature... apart from resuming to write blogs and papers abt their systems
[1] like the NetEng/Route53 team once did.

[0] [https://dataintensive.net](https://dataintensive.net)

[1]
[https://news.ycombinator.com/item?id=19290069](https://news.ycombinator.com/item?id=19290069)

(Opinions my own. Facts presented may not be accurate. Zero intention to cause
hurt and anguish.)

~~~
mettamage
What do you think about Maarten van Steen's book?

[https://www.distributed-
systems.net/index.php/books/distribu...](https://www.distributed-
systems.net/index.php/books/distributed-systems-3rd-edition-2017/)

~~~
ignoramous
I'm sorry I haven't read that book, so can't comment. You chould give M
Kleppmann's book a try if you're looking for an introduction to distributed
systems.

------
redact207
I've been building distributed systems, big ones, across banking and gaming
for many years now.

The only thing that makes it successful is correct data modeling and
application boundaries. Domain driven design is a great way to start and cqrs
a good next step.

Trying to make a poorly built application distributed is like trying to make a
poorly built class testable. You can kind of do it, but it's a lot of extra
messing around and it never really works properly.

------
gregoryexe
"If you can, carry a pager"

Why a pager? I wasn't even sure that they still made them.

~~~
closeparen
They do, but it’s just the industry metaphor for being on call. Usually
through smartphone apps like PagerDuty.

~~~
ignoramous
At AWS, it isn't uncommon to get _paged_ via email, SMS, phone-call, and a
smartphone app in addition to having an actual paging device (usually a dumb
phone) separate from one's personal phone or laptop to avoid cases where the
personal phone and email are both down (I believe this has happened, and not
just once) and that someone important is out of reach in a time of dire need.

Source: ex-AWS.

------
tabtab
Keep Conway's Law in mind. You usually can't solve organizational problems
with technology alone. If your staffing structure is not also "distributed",
then making systems that are could hurt you.

------
shinryuu
Does anyone have a good recommendation for a mooc on distributed systems?

~~~
bitL
Reliable Distributed Algorithms Part 1,2 on edX by KTH; Cloud computing
specialization from UIUC on Coursera.

------
maxtollenaar
Marc brooker is a legend

------
anth_anm
Another article about how hard it is doesn't mention the hardest part of all.

Getting to a place where you will ever get a chance to work on this stuff.

I tried joining a team working on it. Figured I wouldn't get to do a lot, but
it was a foot in the door sort of situation.

By the time I quit I had managed 7 months without doing a thing, and 3 more
where I wasn't even trying anymore.

Meanwhile the guys who built the thing all got pulled to work on the next big
thing. Maybe they were just amazing engineers.

Maybe getting hired out of college and putting right onto a massive
distributed systems project is the only fucking way to ever get your foot in
the door.

I'm not gonna let this article ruin my day. It's pointless. I can think about
how colossally fucked my career is on monday.

~~~
appwiz
OT but relevant to your comment. AWS is hiring for various service teams[1]. I
highly recommend it as the place to be if you want to work on massive
distributed systems. DM me on Twitter if any of those jobs appeals to you.

[1] [http://bit.ly/2L8cPHL](http://bit.ly/2L8cPHL)

------
unnouinceput
"List of articles to read to learn about distributed systems" \- there I fixed
the title for you

~~~
AlexCoventry
There's a lot more to it than that.

~~~
macintux
Agreed, but if someone wants to tackle such a list, here’s a meta-list of
distributed systems resources.

[https://gist.github.com/macintux/6227368](https://gist.github.com/macintux/6227368)

