
Using TLA+ to Model Cascading Failures - mbellotti
https://medium.com/@bellmar/using-tla-to-model-cascading-failures-5d1ebc5e4c4f
======
tluyben2
I have been modelling parts of our systems in TLA+ for years now and it really
helps a lot working out complex scenarios.

~~~
Leace
I'd love to read about that in detail. Are there any publicly available
documents you could share?

~~~
Twirrim
Anecdotally, the team I was in in AWS needed to build a complicated component,
one that, should it get it wrong, would be disastrous for the service.

They'd estimated about 4-6 months for a two person team, made from some of the
best engineers in the service, focused entirely on it to get it written,
tested and out to production.

They decided to use TLA+ to model, despite neither engineer having used it
before. They lost about a week to getting up and running with it (one
engineer's only complaint was how tied in to Eclipse it was), and then spent
the res of the month working on and modelling the whole task.

It found problems. A whole bunch of them. The fixed the model until finally
TLA+ gave them an all clear.

Then came the coding. Well... that didn't take very long at all. The TLA+
model effectively outlined all the code and methods for them. The actual
programming ended up being almost a cookie cutter simple code.

In total, the new and complicated component went from drawing board to tested
and ready for production in about 2 months. Despite having had to learn TLA+,
it ended up taking less time than if they'd not written the TLA+ models in the
first place.

~~~
baq
Lamport claims on his website that Amazon uses TLA+ and has used it for quite
a while now. Saying that to confirm plausibility of this story.

~~~
Twirrim
Amazon/AWS has published a few white papers about their use of TLA+ and other
Formal Methods:[http://lamport.azurewebsites.net/tla/formal-methods-
amazon.p...](http://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf).

James Hamilton is a fan of TLA+, and talks about its use in Amazon:
[https://perspectives.mvdirona.com/2014/07/challenges-in-
desi...](https://perspectives.mvdirona.com/2014/07/challenges-in-designing-at-
scale-formal-methods-in-building-robust-distributed-systems/)

------
victor106
Anyone here can shed some light on how to incorporate tla+ into an agile
process?

~~~
pron
Augmenting Agile with Formal Methods:
[https://www.hillelwayne.com/post/augmenting-
agile/](https://www.hillelwayne.com/post/augmenting-agile/)

------
usgroup
Can anyone advise on whether it’s worth modelling general critical systems in
TLA+ where they are not distributed?

In broad strokes, how does one make sure that your real system indeed behaves
like your model?

~~~
pron
> Can anyone advise on whether it’s worth modelling general critical systems
> in TLA+ where they are not distributed?

Absolutely, and not just critical systems. Any system that is either complex
or may have some non-obvious subtleties can benefit from a specification.

> In broad strokes, how does one make sure that your real system indeed
> behaves like your model?

In general, TLA+ can be used to specify very large and very complex systems.
It is currently infeasible to mechanically verify with absolute certainty that
systems of such size conform to a specification, regardless of tools used. The
only systems that can be verified to such an extreme extent (called end-to-end
verification, namely interesting global properties are verified all the way
down to the code, and even to the machine-code level) are very, very small (no
more than about 10KLOC), and even then require a tremendous amount of effort
by experts.

Specifications should be relatively short and clear. They are therefore useful
for stating your assumptions about the system, and then checking the
consequences of those assumptions. Whether the assumptions are accurate,
approximate or wrong can then be verified by inspection -- this is certainly
feasible and commonly done in practice. There are also relatively cheap
mechanical ways to check that the system conforms to the spec, but not with
absolute certainty. One is called trace checking, and its possible use with
TLA+ is described here:
[https://pron.github.io/files/Trace.pdf](https://pron.github.io/files/Trace.pdf)

~~~
apta
How common is it to use TLA+ on existing/pre-built systems, to retrospectively
try to find issues or bugs in them?

~~~
pron
It's hard to answer because using TLA+ in general is far from common. But
among users, I would say that _most_ cases are where they have an existing
system. That was certainly how I first used TLA+ (when I realized there could
be deep design issues with a system I worked on; TLA+ showed me that there
indeed were two serious problems, and it also verified that my fix was
correct).

------
tunesmith
It seems TLA+ would be a really great fit for modeling actor systems, has
anyone found good examples of this, for instance for Akka or Erlang?

~~~
polskibus
Akka has testkit and multi node testkit which you can use to test such
systems. If you need to explore state space, you can use data driven and or
combinatorial testing attributes from *unit libraries.

~~~
pron
TLA+ and any kind of testing serve different purposes. I don't know how good
of an analogy this is, but tests are more like strength tests on a structure,
while a TLA+ is like a blueprint.

