Applying TLA+ in cloud systems [video]

mjb · on Dec 15, 2020

Looks like a great talk. I'm looking forward to watching the whole thing later.

Jumping forward to the TLA+ section, I very much like the point about how TLA+ enabled precise communication between teams. Writing about distributed algorithms is really hard to do well, and the precision of TLA+ makes it much easier to communicate points in a clear way. This is true of math in general, of course, but my experience has been that many folks are more willing to work through TLA+ code than work through written formal math (and, once you learn it, it's much more approachable in this domain).

We've been using TLA+ at Amazon for a long time (https://cacm.acm.org/magazines/2015/4/184701-how-amazon-web-...), and of course it was developed at Microsoft. Despite that, it's still quite a small community, so I'm always excited when it comes up.

daxfohl · on Dec 15, 2020

This was disappointing. It is a list of products where they used TLA+ and a marketing-level description of how TLA+ helped. The first mention of TLA+ doesn't come until almost halfway through.

usrme · on Dec 15, 2020

To those more familiar with TLA+: is it worth looking into TLA+ when one is not building distributed systems? I'm really drawn to the formal proofing part, as that seems to be the major benefit, but it might be overkill for my use case, which is more akin to providing developer-oriented tooling and supporting cloud infrastructure.

pron · on Dec 15, 2020

I would say yes. There is nothing that inherently connects TLA+ to distributed systems. It's just that 1. it's general enough to deal with nondeterminism and concurrency that, unlike some other formalisms, it can be easily and naturally used to specify them, and 2. distributed systems and concurrency are a very common example of subtle algorithms that engineers often encounter and are tricky to get right without help. The combination of these two aspects, as well as the fact that the inventor of TLA+, Leslie Lamport, is very well-known for his work in distributed systems, led to TLA+ being popular in that field.

The level of trickiness/subtlety of the problems you encounter can determine how much benefit you'll see from TLA+, although I believe people often underestimate it (in a way, all bugs are a result of programmers underestimating the complexity of a problem).

TLA+ is relatively easy to learn compared to a new programming language, and you can become reasonably proficient in it in two weeks of self study. The biggest difficulty is unlearning some habits from programming and internalising that while writing a mathematical specification shares some superficial similarities to programming the activities are different in most ways (and fun in very different ways).

csharptwdec19 · on Dec 15, 2020

> The level of trickiness/subtlety of the problems you encounter can determine how much benefit you'll see from TLA+, although I believe people often underestimate it (in a way, all bugs are a result of programmers underestimating the complexity of a problem).

Yep, and a good example is when TLA+ was used against some AWS services [0], it was able to expose bugs that were 35 states deep. To be able to find such a bug in production would require very good tracing most likely; to have a specification detect such bugs before you have even coded the thing is invaluable.

> 2. distributed systems and concurrency are a very common example of subtle algorithms that engineers often encounter and are tricky to get right without help.

I would posit that Distributed systems are typically a category of concurrent systems; just at a different level of abstraction (threads vs nodes)

[0] - https://en.wikipedia.org/wiki/TLA%2B#Industry_use

pron · on Dec 15, 2020

> I would posit that Distributed systems are typically a category of concurrent systems

Absolutely, and so are interactive systems, which are distributed/concurrent between the computer and the human user. In TLA+ all of these are described in the same way, but colloquially, engineers think of them differently, and there are actually good reasons for that, too. E.g. virtually no concurrent algorithms (for a single computer) account for a failure of a single core and its associated L1 cache, but all distributed algorithms do.

tyu2 · on Dec 15, 2020

> as well as the fact that the inventor of TLA+, Leslie Lamport, is very well-known for his work in distributed systems, led to TLA+

But didn't lead him to come up with decent distributed algorithms. PAXOS is a negative example here for example, where Leslie said that formalization naturally derives such algorithms, but in reality many years and many iterations on the idea from many different researchers and engineers is what led to something actually useful in practice, as PAXOS wasn't, and formalization was by far the least important thing ever.

pron · on Dec 15, 2020

I'm not sure how you're getting this. In what way is Paxos a negative example? Raft, which is also a take on Paxos, was also formalised in TLA+ by its inventor, as were many other variations on Paxos; in fact, it is hard to derive any non-trivial distributed algorithm, based on Paxos or otherwise, without formalisation. This is true for most subtle algorithms, and distributed algorithms are prime examples of those. Algorithms that had not been formalised on inception, Chord being a famous example of one thought to have been "proven" correct without formalisation were often found to be incorrect once formalised [1]. Not to mention that that work won Lamport the Turing Award.

[1]: https://brooker.co.za/blog/2014/03/08/model-checking.html

mjb · on Dec 15, 2020

As others have said, TLA+ can be very useful outside distributed systems. For example, I'm currently working on a virtualization-related low-level systems paper (likely to be submitted to VEE) where we use TLA+ to reason about the safety of some new syscalls and their related programming patterns. Reordering and concurrency of interactions between processes and the kernel can lead to many of the same challenges as distributed systems.

Many people have also used TLA+ for business logic reasoning, but I haven't yet tried it in that domain (I tend to just do state machine stuff, and sometimes use Alloy).

tyu2 · on Dec 15, 2020

You can look into TLA+ without any specific reason, it's relatively easy to grasp and doesn't take much time. And it isn't actually that popular in distributed systems, as it demands certain software development process most real world distributed systems can't benefit from, the only examples of distributed systems are literally distributed cloud services made at megacorps no less, like in the video. But mostly it's purely people who are into formal proofing who play with it, regardless of whether there are actual benefits. I don't like doing it and was never able to justify using it on any real project, I was always able to find more beneficial ways to spend this time on instead. And tiny cases where I really needed to make sure of something I just covered the whole state space with tests and simulations. Maybe if it wasn't written in java and all the tooling was some portable C easily integratable anywhere things could be different, but as it is I can't even accept having jvm everywhere.

Jtsummers · on Dec 15, 2020

I've used it for small-scale concurrent systems and it's been useful. I've never used it for large distributed systems as I don't build or work on those.

Most of the time I choose built-ins that largely handle the normal concurrency problems (once you have a concurrent queue, if you never share data otherwise, a lot of problems are just gone). But in embedded systems that's not always an option (either you have to build it yourself or the structure of the hardware doesn't really lend itself to it). Sharing data across co-processors can lead to subtle and hard to detect issues. Reasoning about how they communicate via TLA+ (and other systems) can help make these issues tractable and repairable (or prevent them if applied in advance).

And learning to reason using TLA+ doesn't necessarily mean having to use the full toolkit and run models. Just describing the system at this higher level can make issues in the design apparent (this has been my experience, at least).

wallstprog · on Dec 15, 2020

For instance, see this recent post about using TLA+ to find a concurrency bug in glibc's implementation of pthreads:

https://probablydance.com/2020/10/31/using-tla-in-the-real-w...

vbank · on Dec 15, 2020

thank you so much for posting this, I lost my notes and was looking for TLA+ for months

tpoacher · on Dec 15, 2020

Distributed systems engineers sure seem to love their Three Letter Acronyms (TLA)