
Distributed systems theory for the distributed systems engineer - LiveTheDream
http://the-paper-trail.org/blog/distributed-systems-theory-for-the-distributed-systems-engineer/
======
jandrewrogers
This is a pretty limited view of distributed systems engineering, particularly
at the high-end. Some topics that are essential for designing efficient,
massively distributed systems:

\- Greedy routing theory i.e. protocol design that is robust in a Nash
equilibria sense. It still amazes me how many distributed systems are built on
protocol designs that are provably unstable and inefficient.

\- Space decomposition data structures. The only class of data structure that
is efficiently distributable at massive scales due to their mathematical
relationship to space-filling curves. Hash tables are often inadvertently
implemented as space decomposition structures but there is a much richer and
more expressive universe of data structures that exist.

\- Game theoretic schedule awareness. This is a concept sometimes used in
massively parallel systems for HPC but distributed systems designers seem to
be unaware of it. It allows extremely high throughput by eliminating much of
the need for distributed coordination with respect to accessing shared
resources because every process can dynamically schedule its operations based
on its modeling of the decisions of other schedulers it interacts with such
that it almost never conflicts. I've never seen a paper on it but people have
designed systems based on it (hell, I've designed a number of massively
parallel systems based on it).

\- Practical topology. On the surface it is about understanding how to
maximize the throughput of a fabric of switch fabrics and most messaging
patterns used in distributed systems today are naive at an elementary level;
HPC has a much better understanding of packet flow optimization. There is also
the more theoretical algebraic topology that hints how you efficiently do
computational operations over space decomposition structures (mentioned above)
with minimal data motion. Algebraic topology is enormously relevant to
massively distributed and parallel databases but I rarely meet people who
understand it outside of Oracle and similar (not that most of them are using
it well).

A lot of distributed systems were engineered without any awareness of these
things. Most of their fundamental weaknesses follow from that. The kinds of
distributed systems that you can design if you really know the above topics
greatly exceed the capabilities of popular distributed systems used today.

To reduce it to a simple metric: if you can show me a competitive algorithm
design for the Graph500 benchmark, you actually understand massively scalable
distributed systems. The efficiency at scale demonstrated in that benchmark is
so far beyond common distributed systems because the designers of the top
entries actually understand the above points I raised.

~~~
gima
Instead of providing resources, continuation, or even references to these
topics for interested minds to dive into, I find bashing and gloating. Or am I
reading this completely wrong?

Don't get me wrong, it's good to point out the writing's shortcomings, but as
you appear to be knownledgeable in the field, please don't force others to
begin from the bottom. Share your knownledge.

~~~
Game_Ender
Took a little bit of reading but the OP is founder of company [0] making a
massively scalable geospatial database. This is the closest article [1] I can
find which has anything close to details. It lists the following technologies:

* A low level stack that bypasses the OS for at least disk IO (supposed to provide a 2-3x throughput increase)

* "polymorphic" space-filling curves for distributed parallel indexing

* hyper-dimensional spatial sieves [2]

* Allen’s Interval Algebra to parallelise SQL statements

More discussion on some of the OP's previous hacker news posts [3].

[0] - [http://spacecurve.com/](http://spacecurve.com/)

[1] - [http://www.it-director.com/blogs/Bloor_IM_Blog/2014/7/how-
do...](http://www.it-director.com/blogs/Bloor_IM_Blog/2014/7/how-do-you-map-a-
sphere-to-a-plane.html)

[2] -
[http://www.google.com/patents/US20090182837](http://www.google.com/patents/US20090182837)

[3] -
[https://news.ycombinator.com/item?id=7482151](https://news.ycombinator.com/item?id=7482151)

------
i0exception
I've found these video lectures by professor Seif Haridi to be really useful.
[http://www.ict.kth.se/courses/ID2203/video_lectures.html](http://www.ict.kth.se/courses/ID2203/video_lectures.html)

Edit :
[https://www.youtube.com/playlist?list=PL700757A5D4B3F368](https://www.youtube.com/playlist?list=PL700757A5D4B3F368)

~~~
shmerl
Just for the reference, with AdblockPlus enabled, the Youtube link images on
that page are hidden (at first I couldn't find any videos).

------
jaekwon
You must go beyond FLP impossibility. The problem with just stopping at
understanding FLP is that it assumes too little about what is available. No
assumptions of synchrony, no assumptions of clocks, then sure, consensus is
impossible in the presence of faults. The real interesting question is, what
realistic assumptions can I make to overcome FLP? I believe the CAP theorem is
similar.

Also, if you're interested in cryptocurrencies or distributed systems with
"greedy" participants, that's another class of problem that goes a level
beyond byzantine consensus. The difference is that you cannot assume that even
the "good" participants are running the officially sanctioned software, but
rather you must assume that there may be colluders who attempt to game the
network.

Byzantine generals problem + game theory -> ?

~~~
HenryR
Here's one paper on the intersection of Byzantine fault tolerance, altruiusm
and rational behaviour from UT Austin:
[https://www.cs.utexas.edu/lasr/download.php?uid=63](https://www.cs.utexas.edu/lasr/download.php?uid=63)

------
streametry
The article doesn't directly mention network partitions and split-brain
syndrome. Luckily there is Aphyr's Jepsen series [1] with both theory and
practice.

[1] [http://aphyr.com/tags/Jepsen](http://aphyr.com/tags/Jepsen)

------
krat0sprakhar
This is really awesome. I've been spending my nights on improving my
understanding and was wishing for a resource exactly like this. I've been
going through Prof Aspnes (of Yale) notes -
[http://www.cs.yale.edu/homes/aspnes/classes/469/notes-2011.p...](http://www.cs.yale.edu/homes/aspnes/classes/469/notes-2011.pdf)

A question to the HN folk - For a programmer who spends his day time writing
web applications and no systems experience, are there any good project ideas
you guys have to better understand Distributed systems? I was thinking of
implementing Paxos in Go / Scala but something more practical would be better.

~~~
owlish
I've been looking at MIT's Distributed Systems class which uses Go (posted
here about a year ago).

Original Link:
[http://css.csail.mit.edu/6.824/2014/labs/lab-1.html](http://css.csail.mit.edu/6.824/2014/labs/lab-1.html)

HN comments:
[https://news.ycombinator.com/item?id=5192650](https://news.ycombinator.com/item?id=5192650)

------
bochi
This article is the best introduction to the topic in my opinion:
[http://steve.vinoski.net/pdf/IC-Rediscovering-Distributed-
Sy...](http://steve.vinoski.net/pdf/IC-Rediscovering-Distributed-Systems.pdf)

Another good reference for understanding distributed databases is: [http://cs-
www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf](http://cs-
www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf)

------
blutoot
I would like to know if any site reliability engineer or sys-admin or someone
in any (dev)ops role ever seeks the help of distributed systems theory. I have
seen many of the real-life systems that are listed in the article thrown in as
a recommended reading for preparation for many job interviews. I've just never
managed to figure out in what situations or scenarios (especially
troubleshooting-related) the knowledge of the design or the design principles
behind a distributed system comes really handy for a person in any of the
roles I mentioned above. Any enlightenment will be greatly appreciated.

(EDIT: I guess what I'm trying to get at is that it would be nice to read a
detailed post about how someone who maintains a big distributed system handled
a major troubleshooting or scaling problem and how, if any, distributed
systems principle or theory came handy in that exercise.)

~~~
ajdecon
I don't have time to write a detailed post, but would a book recommendation
work? :)

"The Practice of Cloud System Administration: Designing and Operating Large
Distributed Systems", coming out next month, provides a good viewpoint on how
distributed systems design knowledge is useful to someone in a sysadmin role.
(I've read a preview via Safari Online and found it an excellent resource.)

[0] [http://www.amazon.com/The-Practice-Cloud-System-
Administrati...](http://www.amazon.com/The-Practice-Cloud-System-
Administration/dp/032194318X)

~~~
blutoot
Thanks! I bookmarked this one a few weeks ago and can't wait for it to come
out next month.

