
Ask HN: Learning about distributed systems? - shahrk
I used to love Operating Systems during my undergrads, Modern Operating Systems by Tanenbaum is till date the only academic book I&#x27;ve read entirely. I recently read an article about how Amazon built Aurora by Werner Vogels and I was captivated by it. I want to start reading about Distributed Systems. What would be a good start&#x2F;Road Map?
======
otras
I posted this recently, but MIT's 6.824: Distributed Systems (taught by Robert
Morris, of both Morris worm and Viaweb/Y Combinator fame) is completely open
and available online, and it includes video lectures, notes, readings, and
programming assignments from as recent as Spring 2020 (including half of the
lectures recorded from home as the pandemic strikes). The assignments even
include auto-graded testing scripts, so you can verify your solution to the
assignments.

[https://pdos.csail.mit.edu/6.824/](https://pdos.csail.mit.edu/6.824/)

~~~
solaxun
I posted this as well in the last thread about this class, but since we're
discussing it again there is an active study group doing the labs in Clojure
on reddit:

[https://www.reddit.com/r/mit6824clojure/](https://www.reddit.com/r/mit6824clojure/)

Several of the labs have been ported in full (map-reduce, first part of RAFT
lab), including test scripts. Please join if this is interesting to you.. the
more the merrier!

------
weitzj
The book “Designing Data-Intensive Applications” by Martin Kleppman is a
fantastic read with such a concise train of thought. It builds up from basics,
adds another thing, and another thing.

I kept asking myself, what would happen if I were to extend on the feature
currently presented in the chapter I was reading, only to find out my answers
in the next chapter.

Brilliant book

~~~
danny_sf45
Problem with this book is: what to read after. The book is really good but
leaves you with the feeling "there are some many topics I don't know yet...
but all the other books out there suck". Any recommendation for someone that
has already read DDIA?

~~~
Jtsummers
[https://github.com/ept/ddia-references](https://github.com/ept/ddia-
references)

Here is their git repository of all online references, by chapter. Seems it
doesn't include papers that don't have a clear, public link. If you have the
book, though, you can get the names and search for them.

------
DylanSp
* This System Design Primer [1] on GitHub is a decent overview of how large-scale apps are designed, with jumping-off points into many different subjects.

* The Morning Paper blog's distributed systems tag [2] has a lot of good summaries of research on distributed systems, both from academia and industry.

* I maintain a list of assorted resources on distributed system design and operations on GitHub. [3]

* Also, as mentioned, Designing Data-Intensive Applications is a good starting place.

[1] [https://github.com/donnemartin/system-design-
primer](https://github.com/donnemartin/system-design-primer)

[2] [https://blog.acolyer.org/tag/distributed-
systems/](https://blog.acolyer.org/tag/distributed-systems/)

[3] [https://github.com/DylanSp/distributed-systems-
resources](https://github.com/DylanSp/distributed-systems-resources)

------
venkasub
Stumbled on this which I felt was a very good compendium \-
[https://dancres.github.io/Pages/](https://dancres.github.io/Pages/)

Would also recommend reading VLDB and DB it shows how distri algorithms are
applied \-
[http://www.vldb.org/pvldb/vol9.html](http://www.vldb.org/pvldb/vol9.html) \-
[http://www.redbook.io/](http://www.redbook.io/)

Disclaimer: I used to work at Couchbase(distributed NoSQL database) as a PM
and launched Eventing.

------
k00b
Speaking of Werner Vogels, have you seen this blog post on DistSys reading?
[https://www.allthingsdistributed.com/2012/12/paper-
readings-...](https://www.allthingsdistributed.com/2012/12/paper-
readings-2012.html)

------
dastbe
While I don't see it as a starting point (I think the topics require more
context), I'm a big fan of the articles Amazon has published recently as the
"Builder's Library"

[https://aws.amazon.com/builders-library/?cards-body.sort-
by=...](https://aws.amazon.com/builders-library/?cards-body.sort-
by=item.additionalFields.customSort&cards-body.sort-order=asc)

------
melvinroest
Maarten van Steen has got you covered. He worked with Andrew Tanenbaum on all
kinds of things back in the day :)

[https://www.distributed-
systems.net/index.php/books/ds3/](https://www.distributed-
systems.net/index.php/books/ds3/)

------
AdamM12
I had found this to be a really good resource. High level patterns to use.
[https://www.oreilly.com/library/view/designing-
distributed-s...](https://www.oreilly.com/library/view/designing-distributed-
systems/9781491983638/)

------
westurner
From a previous question re: "Ask HN: CS papers for software architecture and
design?"
([https://news.ycombinator.com/item?id=15778396](https://news.ycombinator.com/item?id=15778396)
and distributed systems we eventually realize were needed in the first place:

> _Bulk Synchronous
> Parallel:[https://en.wikipedia.org/wiki/Bulk_synchronous_parallel](https://en.wikipedia.org/wiki/Bulk_synchronous_parallel)
> ._

Many/most (?) distributed systems can be described in terms of BSP primitives.

>
> _Paxos:[https://en.wikipedia.org/wiki/Paxos_(computer_science)](https://en.wikipedia.org/wiki/Paxos_\(computer_science\))
> ._

>
> _Raft:[https://en.wikipedia.org/wiki/Raft_(computer_science)](https://en.wikipedia.org/wiki/Raft_\(computer_science\))
> #Safety_

> _CAP
> theorem:[https://en.wikipedia.org/wiki/CAP_theorem](https://en.wikipedia.org/wiki/CAP_theorem)
> ._

Papers-we-love > Distributed Systems: [https://github.com/papers-we-
love/papers-we-love/tree/master...](https://github.com/papers-we-love/papers-
we-love/tree/master/distributed_systems)

awesome-distributed-systems also has many links to theory:
[https://github.com/theanalyst/awesome-distributed-
systems](https://github.com/theanalyst/awesome-distributed-systems)

\- Byzantine fault:
[https://en.wikipedia.org/wiki/Byzantine_fault](https://en.wikipedia.org/wiki/Byzantine_fault)
:

> _A [Byzantine fault] is a condition of a computer system, particularly
> distributed computing systems, where components may fail and there is
> imperfect information on whether a component has failed. The term takes its
> name from an allegory, the "Byzantine Generals Problem",[2] developed to
> describe a situation in which, in order to avoid catastrophic failure of the
> system, the system's actors must agree on a concerted strategy, but some of
> these actors are unreliable._

awesome-bigdata lists a number of tools:
[https://github.com/onurakpolat/awesome-
bigdata](https://github.com/onurakpolat/awesome-bigdata)

Practically, dask.distributed (joblib -> SLURM,), dask ML, dask-labextension
(a JupyterLab extension for dask), and the Rapids.ai tools (e.g. cuDF) scale
from one to many nodes.

~~~
westurner
Not without a sense of irony, as the lists above list many papers that could
be readings with quizzes,

Distributed systems -> Distributed computing:
[https://en.wikipedia.org/wiki/Distributed_computing](https://en.wikipedia.org/wiki/Distributed_computing)

Category: Distributed computing:
[https://en.wikipedia.org/wiki/Category:Distributed_computing](https://en.wikipedia.org/wiki/Category:Distributed_computing)

Category:Distributed_computing_architecture :
[https://en.wikipedia.org/wiki/Category:Distributed_computing...](https://en.wikipedia.org/wiki/Category:Distributed_computing_architecture)

DLT: Distributed Ledger Technology:
[https://en.wikipedia.org/wiki/Distributed_ledger](https://en.wikipedia.org/wiki/Distributed_ledger)

Consensus (computer science)
[https://en.wikipedia.org/wiki/Consensus_(computer_science)](https://en.wikipedia.org/wiki/Consensus_\(computer_science\))

