Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Recommended books and papers on distributed systems?
302 points by letientai299 on Feb 1, 2021 | hide | past | favorite | 63 comments
The most recent and complete book on Distributed Systems that I'm aware of is Design Data Intensive Application (2017). I'm currently reading it. I also want to learn about other problems and ideas:

- Ideas that stood the test of times

- Ideas that were not feasible but now possible thanks to hardware improvement.

So, what's your recommendations for books and papers on these topics?

I have many recommendations of different kinds:

## Blogs:

- http://muratbuffalo.blogspot.com/

- https://bartoszsypytkowski.com/

- https://decentralizedthoughts.github.io/

- https://www.the-paper-trail.org/

- https://blog.acolyer.org/

- https://pathelland.substack.com/

## Other web resources

- https://aws.amazon.com/builders-library/ - set of resources from Amazon about building distributed systems

- https://www.youtube.com/playlist?list=PLeKd45zvjcDFUEv_ohr_H... - lecture series from Cambridge

## Books

- https://www.cl.cam.ac.uk/teaching/1213/PrincComm/mfcn.pdf - A great book on the maths of networking (probability, queuing theory etc...)

This is a very good list. Adrian's blog is, particularly, a treasure.

I wrote this post a while ago, aiming to answer a similar question about finding paper: http://brooker.co.za/blog/2020/05/25/reading.html One key point there is that there are, in my mind, multiple 'modes' of reading, and I like to use different approaches to finding material for different modes. Those blogs you list are great for curiosity mode. Another great resource there is Twitter: following distributed systems practitioners and researchers, and seeing what they tweet about. When I read a (recent) paper I really like, I often see if the authors are active on Twitter and follow them there if they are.

It's also important not to weight too much on recency. A lot of the canon is actually more approachable than newer papers. For example, Lamport's classic "Time, Clocks" (https://www.microsoft.com/en-us/research/publication/time-cl...) and distributed snapshot (https://www.microsoft.com/en-us/research/publication/distrib...) papers, and Gilbert and Lynch's CAP paper (http://citeseerx.ist.psu.edu/viewdoc/download?doi= are approachable without deep background or systems knowledge. Similarly, John Little's " A Proof for the Queuing Formula: L = λW" (https://pubsonline.informs.org/doi/abs/10.1287/opre.9.3.383) is quite approachable if you have a math background but no systems knowledge, and is one of the foundational results behind the practice of building stable systems.

I've got some longer-form dives into researcher's work here: http://brooker.co.za/blog/2014/03/30/lamport-pub.html http://brooker.co.za/blog/2014/09/21/liskov-pub.html http://brooker.co.za/blog/2014/05/10/lynch-pub.html

Finally, books. There are a couple recommendations for Martin Kleppman's Designing Data-Intensive Applications book, which I like a whole lot. Alex Petrov's "Database Internals" is also a very approachable introduction. I wish every practitioner in the field would read Harchol-Balter's Performance Modeling and Design of Computer Systems.

More gentle introduction to timeless concepts that stood the test of times.

Material from a university master course on distributed systems engineering. We developed our own university material in .ipynb format at Github. Should give you the basics in 5 days. Applies distributed systems theory in 5 steps. Most improvements in past decade are not due to hardware improvement, but algorithm and tooling advances I believe.

- Distributed systems. Overlays and communication network. Introduction to simulation framework https://github.com/grimadas/BlockchainEngineering/blob/maste...

- Gossip. Convergence of the transactions, information https://github.com/grimadas/BlockchainEngineering/blob/maste...

- Faults in distributed systems: crashes and disruptions https://github.com/grimadas/BlockchainEngineering/blob/maste...

- Malicious nodes, adversary model https://github.com/grimadas/BlockchainEngineering/blob/maste...

- Consensus and agreement despite malicious nodes https://github.com/grimadas/BlockchainEngineering/blob/maste...

I couldn't agree more about curiosity mode (I'm going to use that phrase liberally). Despite reading papers for many years, I rarely go in cold. I browse blog posts and twitter, ask myself a series of questions, then try to find/read papers to answer them. Of course this only leads to more questions, and so the journey continues.

I also agree with the recommendations for "Designing Data-Intensive Applications" and "Database Internals". Though, having read the latter for a book club at $employer, I felt it served better as a sort of "index for the space" for people who already had some DB experience, rather a true introduction.

> Another great resource there is Twitter: following distributed systems practitioners and researchers, and seeing what they tweet about.

What's your twitter handle?

I recall paper trail had some errors in their post on paxos (iirc), which made me more skeptical of the publication in general. But these are good links!!

Well, "Paxos Made Simple" has some errors, so if you're skeptical about stuff with errors your options are rather limited.

See https://lamport.azurewebsites.net/pubs/pubs.html#paxos-simpl... Ok, might not be fair to call it an error, but it is super confusing, and a great example of Lamport's larger point that writing prose about these things is really hard, and we need better tools (like TLA+).

Interesting note: "one sentence in this paper is ambiguous [...] I am not going to remove this ambiguity or reveal where it is."

I find it surprising that he didn't actually attempt to change this problematic sentence if it was clearly shown that it leads to incorrect implementations. I get his point about prose, but a tiny footnote on a webpage doesn't feel like the best way to highlight what seems to be a common misunderstanding.

Any idea what this ambiguous sentence might be, and what is this mistake that implementers tend to make?

Time, Clocks, and the Ordering of Events in a Distributed System by Leslie Lamport is one of the best papers I've ever read. Not only has it clearly stood the test of time, but it sets the stage for deeper thinking on many of the issues endemic to distributed systems.

My former manager recommended it to me when I first started working in distributed systems and I found that it unlocked a huge variety of topics despite its simplicity. (Thanks Steve!)


For more, I recommend reading "Concurency: The works of Leslie Lamport".


All of the chapters are freely available (scroll down to "Book Downloads").

Any of Lamport's papers are classics and great reading for understanding distributed computing!

I thoroughly enjoyed "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" from Martin Kleppmann.

It's a great book that goes into pretty much all of the commonly used strategies to scaling data-intensive applications. It's not incredibly deep on any of them but it will allow you to get a great overview of the entire space. For each component, there's usually references to places where you can read and study more about them.

Martin also has a short course about distributed systems on YouTube:


And I found some MIT course as well on YouTube:


And direct links to all the references he mentioned in hios book: https://github.com/ept/ddia-references

One of the best. I recommend it to people in any speciality of software engineering. (Although I believe the poster is currently reading this book)

This is a great book, indeed.

Also recommend

Joe Armstrong's (co-creator of Erlang) Dissertation is incredible.

Making reliable distributed systems in the presence of software errors


Jim Gray’s Tandem white paper makes for a nice complement to Joe’s.

“Why Do Computers Stop and What Can Be Done About It?”


I just followed the course on Distributed Systems by Maarten van Steen at the University of Twente. It was brilliant, and he has written a book on Distributed Systems. It is freely available via https://www.distributed-systems.net/index.php/books/ds3/

The class I took used 'Distributed Systems: Principles and Paradigms' which is an older book by the same authors and it was great.

Maarten is one of the best teachers, if not the best teacher I ever had. I was lucky enough to have him as my teacher for three major courses in my curriculum: Computer Networks, Operating Systems, and Distributed Systems.

I am surprised no one has recommended Leslie Lamports writtings [0].

His paper, "Time, Clocks, and the Ordering of Events in a Distributed System" is still considered a serious read after 40 years.


I can highly recommend Linsey Kuper's 2020 introductory lectures on distsys — 26h of quality content!


Dijkstra's "Self-Stabilizing Systems in Spite of Distributed Control" is very "Dijkstra":

- Concise

- Approachable

- Entertaining

- Insightful

- Timeless


Enjoy :)

Awesome! What a great quick read. I've been wanting to start diving into the subject and this is a great intro into reading distributed systems papers.

> For brevity's sake most of the heuristics that led me to find them, together with the proofs that they satisfy the requirements, have been omitted and--to quote Douglas T. Ross's comment on an earlier draft, "the appreciation is left as an exercise for the reader."

this helps keep it concise ;)


Many seem to have had the same reaction as you! ;)

A Distributed Systems Reading List -https://dancres.github.io/Pages/

A small selection of papers that I find useful (also check the Wikipedia articles for a quick overview):

Communicating Sequential Processes "CSP" by Tony Hoare[0] has a strong influence on Go and Clojure. He also published/contributed to other interesting and influential books and papers.

Making reliable distributed systems in the presence of software errors by Joe Armstrong[1] (Erlang, BEAM). An implementation of the actor model and functional programming to optimize for reliability.

Conflict-free Replicated Data Types by Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirsk, "CRDTs" [2]. Enable strong eventual consistency, which is typically useful (and implemented) for databases, p2p (chat) applications and other distributed systems.

[0] https://www.cs.cmu.edu/~crary/819-f09/Hoare78.pdf

[1] https://www.cs.otago.ac.nz/coursework/cosc461/armstrong_thes...

[2] https://hal.inria.fr/hal-00932836/file/CRDTs_SSS-2011.pdf


I have a tangential question, in case anybody here has the answer. Have there been any attempts to prove (either formally or informally) that CSP either:

- Can model any computation?

- Is suitable for modelling business processes?

I've heard both of these claims quite often, but haven't been able to find any primary source on the matter.


Introduction to Reliable and Secure Distributed Programming (https://www.amazon.de/-/en/Christian-Cachin/dp/3642152597).

I took a class with Luis Rodrigues (one of the authors), the book introduces the fundamentals of distributed systems. For example, you would build leader election from first principles.

I've always been a fan of Distributed Systems for Fun and Profit: http://book.mixu.net/distsys/single-page.html

This was the first article that really made it all click for me

Distributed Systems 3rd edition (2017), you can get it free at: https://www.distributed-systems.net/index.php/books/ds3/

I used the 2nd edition of this book for a class at the University of Minnesota. I didn't like it at first, but eventually I got used to the style and now it's my go to for reviewing distributed systems concepts. It's good at the introductory stuff, but I would like to see recommendations that would be good follow ons from this.

This thread has piqued my interest. I am curious if the people interested in distributed systems are closer to hobbyists or professionals using the ideas directly in their work.

Distributed systems fans of HN, why are you reading about distributed systems?

I'm the OP. I'm working on unify and improving various Distributed Lock implementations for my employer. The first ugly fact is most of our currently implementations are not safe[0], and the second ugly fact, AFAIK, is the safe ones are slow. I'm trying to find something in between, or better.

[0]: https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...

For the "ideas that stood the test of time" I'm keeping this list https://nvartolomei.com/dist-sys-classics/

Joe Armstrong, Erlang, software for a concurrent world. It passes the first criterion: time tested. Not so much the second of having been previously impossible, because Erlang could and does run on yesterday's hardware.

Erlang isn't theoretical. It's practical engineering. It works because message passing is what distributed systems have to do and at scale portions of a distributed system will become unavailable.

There are very specific problems that require more detailed engineering like Lamport Clocks and Raft Consensus Protocol. But not the general case. The general case is "being good enough" as is the nature of engineering.

I'm surprised the raft paper has not been mentioned and I'm happy to share it.

In Search of an Understandable Consensus Algorithm


I think raft has stood the test of time so far. A very popular implementation of raft is etcd, which is used as Kubernetes' backing store for all cluster data.[0]

[0] https://kubernetes.io/docs/concepts/overview/components/#etc...

Lamport's work linked earlier is great. I'd also point to Maurice Herlihy's (http://cs.brown.edu/~mph/) and Michel Raynal's (https://team.inria.fr/wide/team/michel-raynal/) work. They both have published well written books on the topic and their bibliographies also linked to many relevant references from others.

Go back to the beginning!

Paul Baran's research on Distributed Communications that led to the Internet:

"Paul Baran and the Origins of the Internet" - https://www.rand.org/about/history/baran.html

"On Distributed Communications" - 1964 https://www.rand.org/pubs/research_memoranda/RM3767.html


imho the most useful book you can read.

Not a book nor a paper, but MIT has published their recent Distr. Systems lectures[0] on youtube. Haven't finished it yet but it touches on papers that stood the test of times.

[0] https://www.youtube.com/watch?v=cQP8WApzIQQ&list=PLrw6a1wE39...

Here's the Spring 2020 papers + schedule to accompany the videos/papers:


I would say papers + labs are way more important if you choose to take this course Lectures are used to glue everything together, but body of knowledge is in the papers

Also agree. Currently following the syllabus to learn Go. Having a good time.

"Understanding Distributed Systems" tries to bring together theoretical aspects on the topic (like consensus and consistency models) with practical ones, such as resiliency mechanisms, asynchronous messaging, and observability.


I've learned distributed systems from Andrew S. Tanenbaum books [1]

Distributed Operating Systems Distributed Systems: Principles and Paradigms

[1] https://en.wikipedia.org/wiki/Andrew_S._Tanenbaum#Books

I've just been re-reading this one last week, and I second this endorsement. It's an excellent entry-level book, and I think even the seasoned engineer will gain a some perspective.

Shameless plug for a list I compiled that is specific to Paxos:


I have not checked the links therein for quite some time, but here’s a list of lists on the subject.


In addition to some of the other recommendations here, I really enjoyed "Building on Quicksand" by Pat Helland and Dave Campbell. A lot of eventual consistency made more sense to me after I read it.

Jeff Dean's The Tail at Scale is fantastic: https://research.google/pubs/pub40801/

In addition, learning how various internet protocols (esp routing) are designed and historically evolved would be a good idea. Interconnections by radia perlman interconnections would be a start.

Not papers or books, but I'd recommend you look at the blog articles and such that Basho published about their design of Riak. In particular, they wrote a great article explaining vector clocks as opposed to other synchronizing mechanisms, and they wrote several good articles on CRDTs.

https://riak.com/category/technical/ - Riak blog

https://www.allthingsdistributed.com/files/amazon-dynamo-sos... - Dynamo paper that Riak was in part based on

A previous thread on this topic from 2017:


I recently learned about Distributed Rate limiting. I do not have a definitive source for it but it's something you can explore.

The best and most accessible book on theory is probably Reliable and Secure Distributed systems by Cachin, Guerrouai et al.

Release It! by Michael Nygard is a good book on resilience engineering for distributed systems.

Surprised no one has mentioned the google spanner paper:


I would be interested in principles driven approach with simple implementations in say python vs kubernetes this containers that.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact