Learning to Build Distributed Systems

i0exception · on June 22, 2019

Being on-call/carrying the pager for a complex and unstable distributed system is a great way to understand how to build a good one. Building a distributed system isn't hard per-se. It's almost always the non-determinism and the associated difficulty in debugging errors that's a problem.

gbuk2013 · on June 22, 2019

This has been my experience. The pieces are simple, but being to able to understand what is going on in the whole system is not. Detailed, quality structured logging is very important, sent to a central location that can be queried (e.g Elastic Search / Kibana). I have come across several experienced developers who really don’t like he idea of verbose logging but in my experience it has been critically important for dealing with a mesh of micro services.

_t0du · on June 22, 2019

"How does this even get invoked" and "Where would I go to find these logs" are the two questions that, if the answers are not obvious/obviously documented, there is a big problem with that particular system.

My current company for whatever reason does not value logging at all, to the point where for lots of our systems there just aren't logs/nobody has ever bothered to look for them. It's pretty astonishing to me, a person who has valued logging as one of the highest-value mechanisms to increase tracibility.

gbuk2013 · on June 22, 2019

Prior to moving to software development, I worked 5 years in Support and consequently I have a murderous hatred of anyone writing software that is hard to debug and logging is a huge part of that.

I was actually greatly inspired by a product I used to support which had such detailed logs that you could troubleshoot almost any issue from the (rather huge) log dump. I try to do the same now.

yjftsjthsd-h · on June 23, 2019

I really wish all developers had to do support at the beginning of their careers and so they understand how to build things that the ops team can work with. Devops/SREs are one approach to this, but I'd really like to see it be more widespread.

alexanderdmitri · on June 23, 2019

I share your enthusiam for good logs, so am curious what made this product's logging great from a support perspective?

cbetti · on June 23, 2019

I wonder if they simply haven't seen a strong application of logging, and therefore don't know what they're missing.

Is there an opportunity in your daily work to sneak some good logging into a complex subsystem to show how valuable it can be?

_t0du · on June 23, 2019

For any new work that I do, logging will obviously be at the top of my priority list. Unfortunately, the systems that need the logging the most are the systems that I can only touch when they go down.

QuickToBan · on June 23, 2019

It's basically what happens when technically incompetent and inexperienced people wear management hats. If I was running a firm, the engineers would dictate to management, not the other way around.

_t0du · on June 23, 2019

I feel like this sentiment is easy to say, but basically impossible to implement.

stingraycharles · on June 22, 2019

Yep, you never know what logs you need until you need it. If there’s an economical factor coming into play (because proper telemetry will cost a noticeable amount of money), you can always choose to sample the more verbose logs.

ignoramous · on June 22, 2019

Honest Q: Isn't that's where tracers like AWS XRay and Google Dapper come in? And somewhere down the line, solutions like Envoy/istio try to make the whole ordeal manageable?

dropofwill · on June 22, 2019

You can implement distributed tracing with logs if every player in the system is propagating the headers appropriately. Tools like NewRelic APM and XRay will take care of that bit mostly for you.

Service mesh data planes have a wider scope, but they do make it much easier to implement distributed tracing (I think Istio includes this out of the box).

In the end though these tools can’t replace app logs because they only let you reconstruct what happens in between services, not internal to them.

gbuk2013 · on June 22, 2019

Nothing cloud is relevant when your production network has no outbound access to the internet due to PCI. ;)

Not sure about the other two (will check it out) but we have have apps in 4 programming languages across several teams. Structured (JSON) logging is the easiest way of sending data to one place from literally everywhere.

gbuk2013 · on June 22, 2019

Agree 100% - logging is expensive, in terms of performance hit and storage, but it is totally worth it.

QuickToBan · on June 23, 2019

In the medium to long term, the absence of logging is always more expensive than its presence.

Quekid5 · on June 22, 2019

> Building a distributed system isn't hard per-se.

I think this may be a bit overstated.

It's certainly true that most of the algorithms, etc. are -- if not necessarily simple -- at least understandable/understood generally. IME the problem is really all of the engineering around the algorithmic stuff.

One thing which I don't think is really well understood yet is different levels of Consistency. There are a lot of trade-offs to be made here, but generally I find that it's really hard to help people understand what those trade-offs are and how they could impact the UX and business.

(After that there's things like how to handle configuration changes safely, how to properly dispose of nodes, etc. etc.)

pfranz · on June 22, 2019

What you're saying is true, but I think a big problem in general is that while bespoke problems often require bespoke solutions, there's a lot of common problems out there and using a common solution goes a long way. They make the pitfalls and limitations obvious, help greatly in staff turnover and growing your team, and make things more intuitive and consistent.

I see a lot of systems that work pretty well when the author maintains it, but someone else jumps in and has no idea what server some part is running on, where the credentials might be, or where the log file is.

fma · on June 22, 2019

This is why I roll my eyes when I see resumes, especially from senior engineers who job hop every year or two. They may have never deployed their system or even they did, have enough support time to learn how supportable their system is, how their system performs, how modular it is when requirements change, etc. Resume looks great, though.

QuickToBan · on June 23, 2019

I have to dispute this. A few months is sufficient to appreciate the stated concerns. A lot can happen in a few months. I am proof of the same. I think under two months is too short, however.

bitL · on June 22, 2019

> Building a distributed system isn't hard per-se.

Well, we just have some ugly proofs that building a proper distributed system is impossible, leading to many trade-offs that trigger one or the other customer at times... There aren't that many areas in computer science which are as difficult as building reliable scalable distributed systems.

undergrowth54 · on June 22, 2019

I think successfully doing this could be a good way, but I don’t think just carrying the pager would teach you much. I think unless you are able to understand and debug the system and handle alerts effectively, this experience would just leave you frustrated.

wippler · on June 22, 2019

I found "Designing Data-Intensive Application" book to be really helpful introduction to learning more interesting details about distributed systems. The book provides a gentle introduction to build intuition around these systems and contains a plethora of links to go further down the rabbit hole.

https://dataintensive.net/

barbecue_sauce · on June 22, 2019

It's about 1000x better than "Designing Distributed Systems" which is basically just a book about Kubernetes. (Should have known, since its written by Brendan Burns).

healsjnr1 · on June 22, 2019

Aside from theoretic advice, one of the key things I've found euros best is _not_ to set out to build a big complex system.

Solve small problems that deliver value in distributed way and then iterate and expand. Doing this focuses engineering on the real problems that need to be solved at the scale you are working at. This delivers real understanding of what is going on, both theoretically and specially within your domain.

The retort to this is often you'll never get to scale, or that your architecture will be incoherent. Basically you need to see the whole picture and build towards it.

The caveat is that solving small problems doesn't mean 'do something quickly or hacky'. If you do, sure you'll end up with an incoherent mess.

If you maintain good engineering practices, and focus them on real problems, my experience is you will scale your platform. These practices will keep focused on solving real problems properly, and doing so will give you an understanding of what the next problem is going to be.

cbetti · on June 23, 2019

The entry cost of introducing the first set of distributed applications is so high you need backing.

gregorygoc · on June 22, 2019

I was being bootstrapped into the new project (pretty big distributed system) for 6 months since I went on call. But then my debugging skills skyrocketed, and I know it’s a little inconvenient to wake up at 5 am because of pager duty sad trombone sound, but I know it’s an investment on my part to become a better distributed systems engineer.

ignoramous · on June 22, 2019

Disclaimer: ex-AWS. Not for reproduction.

> There are a lot of distributed systems books I love, but I haven't found an accessible introduction I particularly like yet.

I'd highly recommend Martin Kleppmann's Desgining Data Intensive Applications [0] as an introductory book on distributed systems.

> Here's Jaso Sorenson describing the design of DynamoDB

For the uninitiated, Jaso played a key role in S3's evolution and was one of S3's founding engineers along with the then CTO and now Distinguished Eng, Alan Vermeulen (who's probably Jeff Dean of Amazon), who btw is an expert speaker. Such a shame folks outside of Amazon don't get to see his numerous talks. Absolute legends.

> Colm MacCarthaigh talking about some principles for building control planes.

Jaso also built "one of the most complex distributed systems ever built at AWS", what then later formed inspiration for AWS HyperPlane and the NetworkLoadBalancer by a team led by Colm, who inturn, was a founding eng with Route53 (first-ever 100% uptime public service at AWS?), VPC, and CloudFront among other very fancy security things.

> I've (Marc Brooker) been doing this stuff for 15 years in one way or another, and still feel like I'm scratching the surface.

Well, he's being humble here. Marc's internal wiki pages on various designs he's come up with over the years for AWS are absolute gold mines, in that they explain his thought process, his experimentation, and his research with existing publications. Along with Colm (and a few others), he's is a prolific speaker internally at AWS with some of his talks ranked in the top 10 consistently. His recent contributions with EC2/EBS and Lambda mean he's worked with the largest and most complicated distributed systems there ever has been at AWS (imo, of course).

There are others like Eric Brandwine (Security), James Hamilton (Data Centers), Peter Vosshall (Silk, EC2, and Architecture), Becky Weiss (VPC and Lambda), David Yancek (IoT, NoSQL), Andrew Certain (DB), Tim Rath (DB), Stefano Stefani (DB, Warehouse, AI), Brad M (S3), Nafea B (Hardware), MSW (EC2/OS), A Ligouri (EC2/OS), Hall Cary (Builder Tools), Marvin T (Kinesis) et al who don't get a mention in the blog post but have been every bit gigantic in their contributions at AWS since forever. The recent recruits at the top rungs of eng at AWS have considerable pedigree coming in, as well. Exciting times, for sure.

I wish they'd release those internal videos (and CoEs) on a case-to-case basis to public. That'd go a long way in contributing to the distributed systems literature... apart from resuming to write blogs and papers abt their systems [1] like the NetEng/Route53 team once did.

[0] https://dataintensive.net

[1] https://news.ycombinator.com/item?id=19290069

(Opinions my own. Facts presented may not be accurate. Zero intention to cause hurt and anguish.)

appwiz · on June 23, 2019

Here are a couple videos from reInvent 2018:

Jaso talking about DynamoDB internals https://www.youtube.com/watch?v=yvBR71D0nAQ

Marc talking about Lambda internals https://www.youtube.com/watch?v=QdzV04T_kec

mettamage · on June 22, 2019

What do you think about Maarten van Steen's book?

https://www.distributed-systems.net/index.php/books/distribu...

ignoramous · on July 6, 2019

I'm sorry I haven't read that book, so can't comment. You chould give M Kleppmann's book a try if you're looking for an introduction to distributed systems.

redact207 · on June 22, 2019

I've been building distributed systems, big ones, across banking and gaming for many years now.

The only thing that makes it successful is correct data modeling and application boundaries. Domain driven design is a great way to start and cqrs a good next step.

Trying to make a poorly built application distributed is like trying to make a poorly built class testable. You can kind of do it, but it's a lot of extra messing around and it never really works properly.

gregoryexe · on June 22, 2019

"If you can, carry a pager"

Why a pager? I wasn't even sure that they still made them.

closeparen · on June 22, 2019

They do, but it’s just the industry metaphor for being on call. Usually through smartphone apps like PagerDuty.

ignoramous · on June 22, 2019

At AWS, it isn't uncommon to get paged via email, SMS, phone-call, and a smartphone app in addition to having an actual paging device (usually a dumb phone) separate from one's personal phone or laptop to avoid cases where the personal phone and email are both down (I believe this has happened, and not just once) and that someone important is out of reach in a time of dire need.

Source: ex-AWS.

gamegoblin · on June 22, 2019

I use a pager for on-call because the battery life and reception exceed cell phones. I have had coworkers who were on-call at night and their phone decided to update and restart at 3am, resulting in their missing a page.

macintux · on June 22, 2019

Indeed. Uptime and (maybe less relevant these days) better reception in hard-to-reach places.

Plus there are often (or at least used to be) regional paging companies; it’s nice to deal with a service provider who’s local and small enough to care about your business.

One unlikely downside: I recall a very localized outage years ago that apparently no one at the paging company noticed and there weren’t enough users in the area to complain before it bit us.

tabtab · on June 24, 2019

Keep Conway's Law in mind. You usually can't solve organizational problems with technology alone. If your staffing structure is not also "distributed", then making systems that are could hurt you.

shinryuu · on June 22, 2019

Does anyone have a good recommendation for a mooc on distributed systems?

bitL · on June 22, 2019

Reliable Distributed Algorithms Part 1,2 on edX by KTH; Cloud computing specialization from UIUC on Coursera.

maxtollenaar · on June 22, 2019

Marc brooker is a legend

anth_anm · on June 23, 2019

Another article about how hard it is doesn't mention the hardest part of all.

Getting to a place where you will ever get a chance to work on this stuff.

I tried joining a team working on it. Figured I wouldn't get to do a lot, but it was a foot in the door sort of situation.

By the time I quit I had managed 7 months without doing a thing, and 3 more where I wasn't even trying anymore.

Meanwhile the guys who built the thing all got pulled to work on the next big thing. Maybe they were just amazing engineers.

Maybe getting hired out of college and putting right onto a massive distributed systems project is the only fucking way to ever get your foot in the door.

I'm not gonna let this article ruin my day. It's pointless. I can think about how colossally fucked my career is on monday.

appwiz · on June 23, 2019

OT but relevant to your comment. AWS is hiring for various service teams[1]. I highly recommend it as the place to be if you want to work on massive distributed systems. DM me on Twitter if any of those jobs appeals to you.

[1] http://bit.ly/2L8cPHL

unnouinceput · on June 22, 2019

"List of articles to read to learn about distributed systems" - there I fixed the title for you

AlexCoventry · on June 22, 2019

There's a lot more to it than that.

macintux · on June 22, 2019

Agreed, but if someone wants to tackle such a list, here’s a meta-list of distributed systems resources.

https://gist.github.com/macintux/6227368