How do you reason about a probabilistic distributed system?

hinkley · on Sept 11, 2020

Statistics was one of the first classes I really struggled with in school.

When something grabs your attention, you start to suffer from Frequency Bias (like 'all of a sudden' everyone has the same car you bought or are going to buy). But in this case, everyone really does suck at statistics. Which probably explains why when I learned what an actuary was and that they made about the same as an anesthesiologist, I wasn't that surprised.

The ironic part for me is that I kept thinking dependent variables were independent, whereas most of the people I observe think independent variables are dependent (gambler's fallacy, etc), and so vastly underestimate the potential for a 'bad run' of events.

When I still played WoW they posted an article about how they tweaked probabilities a bit to avoid frustrating people. Much of the questing in WoW is based on succeeding at an action a set number of times that has a probability of failure, which turns it into a normal distribution. Your buddy might have gotten lucky and finished a quest in two minutes. Meanwhile, there's a 1 in a million chance that you could still be working on that quest 5 hours later. And there are 10 million users...

So to prevent rage quitting, they made the success rate dependent. Which they can do because the scenario is fictitious. You can't do the same thing in the real world.

I'm not sure if there's a way through this or around it (except maybe "always hire a professional"). More likely in 10 years we will discuss how we should have just backed away slowly from these probabilistic situations and done something else.

Swizec · on Sept 11, 2020

My favorite version of this is that shuffle mode is really hard to implement for music . You can’t use random sort because it leads to too many repetitions of songs close together. When that happens people complain that your shuffle isn’t random enough.

colinmhayes · on Sept 12, 2020

This, but spotify's shuffle is actually horrendous. It only plays songs within something like 50 of the one it's on. If you have a multi-hundred playlist you'll only hear the same 50 songs if you put it on shuffle and click the first song until it's finished with all 50 of those songs. Then it'll go to the next 50.

tylerk · on Sept 12, 2020

I am amazed that it has remained so bad for so long! I find it very frustrating and have resorted to copying playlists into a text line shuffler and re-adding the songs...

hinkley · on Sept 12, 2020

Random and shuffle are two very different actions (pick a card vs take the top card).

I mean it’s even in the name. Shuffle.

mjb · on Sept 11, 2020

Very cool post. I like this point particularly:

> In this case we can’t assume, in general, that the thread scheduler will assign each thread processor time with uniform probability. That assumption makes even less sense in a fully distributed system with processes running on separate computers connected by a network.

I would love to see tools in this space mature some more. TLA+ has had a huge influence on how I approach distributed and concurrency problems, and a large influence on the industry in general. Modelling non-determinism (beyond the 'try all paths' approach), and assigning probabilities to outcomes, seems like the large missing piece needed to model real-world systems and protocols.

In systems with limited redundancy, for example, there's always a "transition to a failure state" path which tends to get ignored in modelling with tools like TLA+. That's because these tools can tell you that the whole thing might fail (which is useful), but not the probability that the whole thing would fail (which is significantly more useful).

brzozowski · on Sept 12, 2020

Anyone with a software engineering background interested probabilistic model checking and learning automata should check out the work from ICSAS led by Prof. Lijun Zhang, Yong Li and Andrea Turrini. In particular their work on ePMC and ROLL are really excellent:

https://github.com/ISCAS-PMC/ePMC

https://github.com/ISCAS-PMC/roll-library

Storm is another probabilistic model checker from RWTH Aachen, developed Christian Hensel, Sebastian Junges, Tim Quatmann, Matthias Volk et al. at , which has a nice Python API:

https://www.stormchecker.org

hwayne · on Sept 11, 2020

One thing I've started doing with my PRISM models that helps a lot is using formulas as a poor man's strings. Like instead of

     // Not-yet-flipped: 0, heads: 1, tails: 2

I'd do

    formula unflipped = 0;
    formula heads = 1;
    formula tails = 2;

Another thing that helps a lot is writing templates and then writing scripts that write out the spec for you. I'm doing this with a queueing problem and it's godsent.

hinkley · on Sept 11, 2020

One of the interesting things about Raft that doesn't get focused on much is that strictly speaking Raft does not require that all nodes be able to see each other, just that some nodes can see all of the other nodes.

One of the reasons we use consensus protocols is because with enough hardware involved, the machines cannot even agree about which machines are up and responding to traffic.

In Raft, a leader has to be able to communicate with more than half of the cluster. But the followers don't strictly have to be able to see each other (and if I understand the election protocol correctly, not even during election). The leader is the hub of a wheel, not a member of a star-connected graph, and as long as the 'spokes' are all working, there is no partition. Even if individual links keep disappearing.

Is that true of the snowflake protocols mentioned here? Or do they require a star topology to mostly be intact?

mjb · on Sept 11, 2020

> One of the interesting things about Raft that doesn't get focused on much is that strictly speaking Raft does not require that all nodes be able to see each other, just that some nodes can see all of the other nodes.

The same is true of Paxos. If you have a common view of participant health, then consensus is much easier than in the general async model, so Raft and Paxos (and Viewstamped Replication, etc) are like this for a reason.

I don't know about SnowFlake, but there is a very interesting tradeoff here between graph connectedness and resilience to the failure of masters (e.g. if the topology is really a star, then you have no real redundancy and you may as well have a single master rather than using consensus).

ryanmarsh · on Sept 12, 2020

I found this paper helpful “Debugging Distributed Systems with Why-Across-Time Provenance”

https://mwhittaker.github.io/publications/wat_SOCC18.pdf

stack_underflow · on Sept 11, 2020

Couldn't message you on linkedin for some reason (looks like they've locked the messaging feature to premium now??) - but just FYI, looks like the twitter account linked on your blog in the footer is owned by someone else.

ahelwer · on Sept 11, 2020

Oh thanks! Looks like another Helwer reclaimed it. I think you can message people on linkedin as part of a connect request maybe?

duutfhhh · on Sept 11, 2020

That's a worthy idea to learn. I've actually found it quite trivial, which means I'm either too smart or too dumb to understand it. That MDP diagram reminded me of the Feynman's path integral model of electrons: the double slit experiment looks as if electrons take all possible, within constraints, paths and the interference image is the sum of them. Except that in the MDP diagram, parameters of the two slits can vary, within some boundaries, and that varying changes the interference image, so the question that MDP answers is "how the interference image can change if we vary the slits parameters within these boundaries"?

hwayne · on Sept 11, 2020

> That's a worthy idea to learn. I've actually found it quite trivial, which means I'm either too smart or too dumb to understand it.

Or that Andrew is very good at explaining things :)