Hacker News new | past | comments | ask | show | jobs | submit login
TLA+ model checking made symbolic (acolyer.org)
197 points by feross 54 days ago | hide | past | web | favorite | 51 comments

Is anybody using TLA+ in production outside of academia? What for? I've always wanted to, but I'm not familiar enough with it to always recognize when it'd be the right tool to reach for.

I primarily code in Elixir and Erlang, and have used property testing to perform some level of model checking when working with state machines, and I suspect TLA+ could be helpful for formalizing that a little bit, but it isn't a theory I've ever put into practise.

I used it in the context of Linux kernel development to model/verify various algorithms. I gave a presentation last year at Linux Plumbers:


Slides 4 and 28 list actual bugs found by such modelling.

I do find TLA+ (or formal modelling in general) pretty useful for real world cases, only that, unfortunately, I don't always have the time.

This presentation is exactly the sort of thing I'm interested in, thank you!

We use TLA+. There's nothing particularly magical about TLA+. There are a variety of modeling languages and model checking toolkits out there. I've yet to encounter someone in industries that have commonly adopted model definition & checking who uses TLA+. It's popularity seems highest among folks in newer SaaS/IT arenas than elsewhere.

More often I run into folks using SCADE's model checker. Tied for a distant 2nd, I'd say I've run into more folks who work with NuSMV, CADP, and mCRL2 than TLA+. This could also be mostly sample bias, and it's worth noting that the folks who use any of these things are in the very extreme minority of architects & developers that I speak to in automotive, med. tech., industrial automation, etc.

Though, like I said, we use TLA+. Not for any particularly good reason than it was a sufficient tool among the variety of comparable tools for what we needed to do, but I expect that if we had to go back and do it all over again we'd have selected something else. The specification language has some annoying warts, the checking performance isn't comparatively very good, and the provided IDE works in almost exactly the opposite way any of us wish it would work. To the point that one of our engineers figured out how to wire it all up through Emacs so we could stop monkeying around with TLA Toolbox for 90% of the work. Of course, everything has problems, so TLA+ isn't notoriously different here. We've just used it in anger more than other things.

For whatever it's worth, I also used to be an Erlang developer and used to play around a bit with McErlang, which I think might be a dead project now. However, there's an active project called Concuerror (https://concuerror.com/) which kind of covers the same domain. I don't know how well it works or what its level of maturity is, but it is very interesting in this one very important regard... it's model checking against your actual program implementation. That's an enormous and fantastic difference. Model conformance of concrete implementation to your design (and checked) model is an enormous problem, and tools which can close or remove that gap can be enormously valuable. What I don't know is the level of sophistication of Concuerror, nor if it is able to check for the kinds of properties you care about in your systems.

TLA+'s secret power is refinement; the entire logic is built around the abstraction/refinement relation. One of the best ways to use TLA+ is to specify you system in two or three levels of detail and check that they are, indeed, refinements. If you just use TLA+ for model-checking and don't use refinement mappings that are possible due to the language's elegance, you're giving up on most of its power, and if that's the case, you might as well use weaker specification languages like Promella and NuSMV. But I think it's worthwhile to learn how to use refinements and enjoy the full power of TLA+ rather than just as a language for model-checking. Unlike Promella and NuSMV, TLA+ is not designed as a frontend language for a model-checker. It's a language primarily designed to allow you to think about and understand your system in interesting ways, and it also has a model checker. It's designed to be read and discussed (hence the emphasis on pretty-printing) much more than written and edited. If you're using it anyway, make the full use of it.

I'm now _really_ curious as to how many people who use and value TLA+ have actually used refinement. I've used it, you've used it, Lamport and (probs) Merz and Markus have used it, most other people I've talked to haven't. Probably time for a community poll or something.

Technically it's impossible to use TLA+ without using refinement, as the entire TLA logic is based on it. Checking Machine ⇒ □P is checking that Machine refines □P. But, of course, the real power -- and what you and I both meant -- comes from writing machine formulas on both sides of the implication sign.

I don't think many of those who use TLA+ use refinement between two machine formulas, but I think that's partly because they still consider TLA+ as a frontend language for TLC, which is unfortunate, because if all you want (or, rather, think you want) is a model-checker, there are probably better ones. In fact, the little use of refinement is just a symptom of this bigger issue. I don't think refinement is always useful, but I do think that everyone who uses TLA+ should use its power for analysis rather than just mundane model-checking. I admit, TLC is so enticing, that once you use it, TLA+ can feel secondary to it. Of course, I'd be happy if people used model-checkers more, whether TLC, Spin, NuSMV, Simulink or others, but I'd be even happier if people used TLA+ for its primary purpose, perhaps after being enticed by TLC: thinking about system design. I guess the problem is exactly what Lamport pointed out: programmers (including myself) don't like to think, especially when we can use automation. To be honest, I don't think I would have come to TLA+ if it weren't for TLC. So, in a sense, TLC both pulls people to TLA+ and obscures its main purpose.

In the TLA+ workshop I imagine one day giving to my colleagues, I start with refinement. I would also start without TLC at all; just showing how TLA+ helps you think -- its main purpose and advantage. When writing specifications, I now always start with refinement, first writing how the system would behave at a very high-level, and only then refining a specific algorithm.

(If anyone reads this without having read the article, and doesn't know what TLC is, it's a model-checker for TLA+ that's included in the TLA+ Toolbox)

I've seen plenty of people use TLA+ to understand systems better without writing refinements. I also think you're in a good position to show comparative models between TLA+ and Spin or NuSMV, to show that the other two are more than adequate if you're not using refinement. I know Zave tried to do a Spin vs Alloy comparison and found Spin really frustrating.

Why exactly is TLA+ better suited to thinking about system design compared to say mCRL2, SCADE, or Simulink? Especially relative to the latter two since they provide a paved road to going from your system model to your system implementation, which TLA+ makes no attempt to do.

I don't know mCRL2 at all, but SCADE and Simulink are niche tools (excellent at what they're for). TLA+ is completely general-purpose because it can describe a system at arbitrary levels of abstraction (and link the levels). You could hypothetically describe the same system at the level of logic gates or at the level of human interaction. Now, it is not true that TLA+ makes no attempt to directly link to the implementation, but it makes no particular attempt to do so. Its ability to do so stems directly from its power to describe a system at truly arbitrary levels. You want to describe your system at your code level so that you can verify your actual code? You can do that. But here's where things get interesting. If you want to verify a specification using any sound tool -- be it a model-checker or deductive proof -- the size of the specification becomes very limited. We can verify about 3000-6000 lines of a specification. It can be this number of lines of a low-level specification, i.e. code, or of a high-level specification. So this means that if we want to verify our actual implementation, our program must be tiny. But TLA+ gives you the option to specify and verify, say, a 2000-line high-level specification of a 2MLOC program to find problems at that level, and that's what people in general-purpose software need most.

If your tool -- any tool -- links directly to the implementation, we know that the verified implementation will be very limited in size. It is TLA+'s ability to choose arbitrary levels of detail that gives it the power to help design large system. Such designs, however, cannot be formally (i.e. mechanically) linked to the implementation. If they could, we'd know how to fully verify large systems, but we don't.

Everything you said about TLA+ here also applies to the other tools I mentioned. You'd just work with them differently to get to similar ends, but with the added benefit of you can actually directly use the output of your efforts. I also don't understand how you arrived at SCADE & Simulink being niche tools, but somehow TLA+ is general purpose?

There must be at least 2 orders of magnitude more Simulink users doing model-based design & development than there are people who use TLA+ for pretty much anything. Simulink has limitations, but being comparatively niche relative to something like TLA+ isn't one of them.

As an aside, not only does TLA+ not make an effort to provide a means of mechanically linking one's model to one's program, but when the question was asked of one of its progenitors how a developer might do this they were literally scoffed at for the suggestion.

You're not describing anything to me that I'm not already aware of insofar as TLA+ is concerned. Having also used several other similar or adjacent tools, and being in constant contact with folks who use them for avionics systems, medical devices, large scale industrial automation, etc., I'm still inclined to say that TLA+ is fine. It's unnecessarily annoying in a bunch of dimensions. I am skeptical that if TLA+ wasn't descendent from one of the distributed systems godfathers, and wasn't associated with AWS in the software developer zeitgeist, that there'd be as much intrigue & interest around it.

Also, I begrudge having had to come to the defense of SCADE and Simulink. I spend a non-trivial amount of my week essentially arguing against them in some respect, but they deserve their due.

> You'd just work with them differently to get to the same ends, but with the added benefit of you can actually directly use the output of your efforts.

With TLA+ you can also directly use the output of your efforts. But if you're talking about compilation to some executable, that, unfortunately, requires a certain level of detail above which you cannot rise due to computational complexity concerns. So if you want that, you'd have to use TLA+ at the same level as other tools that emit code.

> There must be at least 2 orders of magnitude more Simulink users doing model-based design & development than there are people who use TLA+ for pretty much anything. Simulink has limitations, but being comparatively niche relative to something like TLA+ isn't one of them.

Oh, of course! Those are excellent tools, but still niche (they target mostly embedded and/or real-time). You cannot use them to design and verify a 2MLOC system. I wasn't talking about popularity, but about target domains.

> As an aside, not only does TLA+ not make an effort to provide a means of mechanically linking one's model to one's program, but when the question was asked of one of its progenitors how a developer might do this they were literally scoffed at for the suggestion.

First of all, as I said, it does not make a particular effort, because it's not necessary. People have used TLA+ to verify actual C, Java and Go code (that I know of). The problem is that once you do that you are bound by the scaling limitation of all tools that operate at that level. TLA+ is relatively special in that it can rise beyond that limitation, and that is where most of its users find value. Using TLA+ at the code level is just not using it to its full potential, and so misses the point. Unfortunately, specifying and verifying at that level can only be done by abandoning what we call "end-to-end", i.e. verification all the way to the code.

> Also, I begrudge having had to come to the defense of SCADE and Simulink. I spend a non-trivial amount of my week essentially arguing against them in some respects, but they deserve their due.

Of course, and I'm sorry if it sounded as if I was suggesting that TLA+ is better than them at what they do. I, myself, used Esterel decades before first learning of TLA+. TLA+ is better at other things -- things that are more applicable to a much wider application domain. My point is that due to fundamental limitations -- we can only verify a specification of a certain size -- there is an inherent tradeoff between scale and application domain and the ability to generate and executable.

If you know good introductions to those tools, I'd be happy to look! Most of my experience is in things like PRISM, Alloy, TLA+, etc, and I don't have experience with those three. At least on a quick search, one difference between TLA+ and SCADE and Simulink is it's free and open source. mCRL2 is also free and open source, looks like, so that's probably the easiest for me to start looking at.

Concuerror is a very cool tool! I've been very impressed with some of the bugs I've seen it detect. It's definitely a tool I'd like to make more use of.

Thanks for the insightful comment, it's given me a lot of things to research!

Did you all ever open source your emacs tooling? I’d like to see that.

We didn't. It's on our list of things to open source and write about though, so we will when time permits. Probably won't be until after the new year, as we're focused on preparing for CES.

Did you also wire up trace expression evaluation in Emacs?

Yes, working on distributed database. So far it's quite useful -- it is possible to quickly check correctness of changes that you intend to make in few crucial places in code. It took about a week of learning/experimenting to write a useful spec. Also resulting state machine is not that much different from what you can expect in networking code working with epoll().

I use it in production at my job, I primarily use it to catch sneaky bugs in designs before implementing them and to produce something tangible to talk over with other engineers once I've settled on a design.

It's worth learning and using IMHO.

I reach for TLA+ any time I want to temporally model any complicated system and make sure of certain invariants, so usually any distributed system I build. The last really complicated system I used it on was to rewrite a broken message store we had implemented on top of memcached.

I have used it when working on some subtle/complicated system.

BTW, property-based testing and model-checking are not really similar. Model-checking is any automated technique to find an invalidating assignment to some formula or declare there is none (so SAT solvers do model-checking). In the case of programs, model checkers automatically determine whether a program satisfies some property. Unlike property-based tests, they do not run the program on all inputs. In fact, the program is never run at all in the ordinary sense. Even TLC, an explicit state model checker for TLA+, that tries every possible state can check uncountably many executions, each of infinite length in a matter of seconds, where "running" even one of those would take infinite time.

It is sometimes easy to think of a model checker as conceptually running the program on every input, but it's also sometimes important to know that model checkers don't actually do that.

I understand what you're saying, but I think that it's possible to approximate model checking with property testing, by defining your model's invariants as properties, and then generating a list of symbolic commands to run against your model.

Like you said, you won't be testing all of the possible inputs, but it can still be a very effective way to detect the kinds of bugs that would otherwise be difficult to discover with more traditional testing methods.

Here's an example of what I'm talking about: http://propertesting.com/book_stateful_properties.html

Randomized testing is very good, but it is so fundamentally different from model checking that it's not an approximation in any meaningful sense (except in the sense that both are automated methods that can find bugs). Again, even an explicit-state model checker can quickly check infinite executions. An interesting hybrid between the two is concolic testing (see https://youtu.be/MDzRV3OQtyQ which quickly shows the difference in power between concolic testing and randomized testing). Concolic testing is one of the most promising areas in formal verification these days.

Also, the use of the term "non-formal" in the article you linked is incorrect. If it runs on the computer it is, by definition, formal. The author meant "sound" or "exhaustive" (although, while all model checkers are exhaustive, many -- like the one in this article -- don't actually try each input; they're exhaustiveness implies soundness, not a particular brute-force algorithm. How well they can actually beat brute force is, however, another matter...).

BTW, it is possible to use the TLC model checker for randomized tests of TLA+ specifications in a way similar to randomized testing of "stateful properties", as checking random walks of the state graph.

Thank you for clarifying what you mean!

I was using the same terminlogy as the article, and treating my property tests as informal forms of model checking due to their being unsound, but I can see how that isn't exactly correct.

I've come across this repo which use TLA+ to validate data migrations. https://github.com/Shopify/ghostferry

This is really cool! I'm continually blown away by the quality of some of the projects Shopify open sources.

Some of us (read... one of us, and not me, yet) uses TLA+ when theorizing changes to parts of M3DB, which is a distributed time series database.

You can see the specs here, the current TLA+ models the consistency model of data being persisted to disk (flush, snapshot), there was at some point TLA+ for describing the quorum writes/reads along with the background tick event loop but that must be elsewhere now.


Azure Cosmos DB does. There is a good video about it: https://www.youtube.com/watch?v=kYX6UrY_ooA

And here is the Github of their models: https://github.com/Azure/azure-cosmos-tla

I've written some models for "personal use and validation", and I keep it in my toolbelt (with Alloy) in case my thinking is not clear about something

Could you briefly describe some of the models you played with? It seems cool that someone just specified models and verifies them in their spare time.

You may have seen this blog already, but the author of "Practical TLA+" has a really fascinating blog with some fun toy examples. I'm a fan of this article where they model A Link to the Past with Alloy, to prove that dungeons are "winnable" [0].

The rest of the blog is really interesting too.

[0] https://www.hillelwayne.com/post/alloy-randomizer/

Feel free to throw an email my way (it's on the website) if you want to chat a bit more about potential use cases.

My spare time is kind of... weird. They were related to concurrency in an actor system, I validated a design and implementation (can’t share much more). I recommend checking Hillel’s blog as suggested, he has excellent examples (and his book is a good starting point for TLA+ and Pluscal as well)

I'm starting to use TLA+ to verify lock-free algorithms. The tool is essential when all your code fits on the screen, yet you can't comprehend what is going on and miss a race here and there.

So far it's more a personal interest, but I'm advocating for spending more time verifying critical pieces of our code base like MPMC containers, synchronization primitives, etc.

PS. Doing C/C++/C# in gamedev.

I've used it twice "in anger". About a year ago I modelled the LoRaWAN state machine to ensure that I'd caught all of the corner cases (packets that get missed/corrupted). I iterated on my FSM until it no longer had deadlocks, and then basically transcribed it to C and it all worked. I was impressed.

Right now, I'm doing a similar thing. I've got a Bluetooth LE device that has a bit of a request/response layer built on top of the standard characteristic read/write stuff (it's a BLE Iridium modem, if you're curious). I didn't make the device, I'm just writing mobile apps that interact with it. There's a lot going on here:

- apps transition through different states (foreground, background, suspended). There's different things we can do when the app is in different states (e.g. there isn't really a good way to initiate a BLE characteristic read on the iOS side when the app is suspended)

- the Bluetooth hardware transitions through multiple states (off, on, not authorized)

- the BLE connection transitions through multiple states (disconnected, initial connection, services & characteristics discovered, etc)

- the stateful protocol involves writing to a characteristic and receiving either 1 or 2 messages back, later on. Transmitting to the Iridium constellation can take a minute if the modem has to reconnect to the satellites. This transmission may fail (which we will be told about), but a failed transmission remains queued on the device and may succeed later.

- The BLE-Iridium devices notifies the mobile device when something happens via BLE Notifications. Notifications are unreliable, so there's a chance we could miss receiving a notification the device has sent us.

This has already shaken out one potential show stopper: if the app is suspended, it can be woken up by a BLE notification to let us know there's new data available; however, if we miss that notification (it's not a reliable notification), there's no good way to poll the device to see if there's data available when the app is suspended. (Using only the available BLE APIs)

The TLA+ model basically pointed out that the (finite sized) message buffer can grow without bounds if notifications are missed over and over (e.g. a super shitty RF environment). I've got a strategy that might work for polling it periodically, but I don't know that I would have realized this failure path without modelling it.

I may also end up pushing back to the vendor to see if they can implement BLE Indications (reliable) as well as BLE Notifications (unreliable). If a BLE Indication fails, the BLE device should disconnect, which will trigger CoreBluetooth stuff that allows us to reconnect and poll without having to do some hacky timer-related stuff. Can't rely on sending a silent push notification either, since the device will be used in remote areas where there's likely no data coverage.

> The TLA+ model basically pointed out that the (finite sized) message buffer can grow without bounds if notifications are missed over and over (e.g. a super shitty RF environment). I've got a strategy that might work for polling it periodically, but I don't know that I would have realized this failure path without modelling it.

Obviously I don't know anything about your use case, and I apologize if this comes across as unsolicited advice, but this sounds like a pretty textbook example of a situation where load shedding is your only option -- even if you manage to figure out the polling problems.

Oh, the device itself keeps a finite-sized mailbox and deletes the oldest message if it's full.

The concern is that one of the deleted messages might have been important (e.g. informing us that a transmission failed, or informing us that Search & Rescue got the message and is sending a helicopter). The polling is a backup plan that at least gives the phone feedback if it's not working (a failed read informs the phone that the read failed; a failed notification doesn't tell either side that something failed). We can still obviously lose messages if enough of them come in between polling intervals, but it's better than potentially losing messages indefinitely without knowing about it.

The other nice thing is that we can keep metrics for how often the polling process actually finds things that we weren't successfully notified about. That'll either give us leverage to encourage the vendor to switch to indications instead of notifications, or encourage us to find a new vendor :)

Edit: and yes, I 100% appreciate the feedback!

That makes a lot of sense! Thanks for elaborating :)

If you don't mind me asking, what are you working on / what is your background? It sounds like you're doing a ton of really interesting engineering, involving a lot of moving parts.

I don't mind at all! It'll be like a tiny advertisement for my consulting company.

For undergrad, my school didn't have a Computer Engineering program, so I did EE and a special CS program they had for engineers (all of the CS and Math credits of a 4-year CS degree, none of the arts electives). In EE, I focused primarily on small-signal stuff (DSP, Digital Communications, Control Systems, HDL, etc), and in CS I basically drank from the fire hose (my 4th year electives were Security, Compiler Design, Advanced Algorithms, and OS Concepts 2). I went and worked for a few years, first at a startup doing electronic design software, and then at a "hot web company". Got a bit bored, went back and did an M.Sc. in Comp Sci, focused on Distributed Systems (my thesis was basically New Relic before that was a thing...)

About 18 months into grad school, my funding ran out. Coincidentally, a friend showed up and asked me if I wanted to do some backend programming for a side gig he'd picked up. That turned into he and I going full-time running a web/mobile company for a few years. Made decent money at that, rode the "mobile is the new hotness" wave, but wasn't feeling super fulfilled. He ended up moving across the country, and we broke up the company. With my newfound freedom, I started shifting away from web/mobile stuff as much as I could and focusing more on lower-level type work (put that EE degree to use!).

I don't really have a portfolio put together, but here's a few examples of the cool projects; remember that this is like a person's social media feed and it's the highlights. There's been boring projects as well, and there's been a month here and there with no income at all.

- GNSS satellite simulator for testing GNSS receivers. We simulated the orbits of entire constellations of GNSS satellites with ~1m precision, simulated the signals they were transmitting and the delays those signals would experience (due to distance and atmospheric interference), and put the combined output into a Software-Defined Radio. Good enough to convince the receivers I tested with! This had a bunch of C code for doing the simulation and signal generation, and a Python scripting interface on top.

- Live animation/Motion Capture. Took a handful of 1st-gen Kinects, put them on tripods, and turned them into an "all aspect ratio" motion capture system that didn't require special suits/rooms/tags. After putting them around the room, they would auto-calibrate themselves to figure out where they were relative to each other, and then spit out a stream of joint positions in a world frame. This got streamed into... maybe it was 3DS Max? Either way, the joint positions and angles were mapped onto a live model.

- The ill-fated Peachy Printer (if you missed the drama, I'm pretty sure it's still online somewhere). UV laser- and resin-based dirt cheap 3D printer. I ended up in the right place at the right time and ended up designing the digital version of the main PCB (first version plugged into your sound card, my version was USB).

- A fuel-monitoring system that keeps track of tank farms all over Western Canada. I didn't make this hardware, but significantly rewrote a bunch of the software for performance reasons, and added robust networking to it. From the central office, you can see how much fuel is in every tank across 3 provinces and make scheduling/routing decisions. (These are the 80,000L tanks you see on the side of the road in rural areas).

- A giant hexcopter for agriculture. This is still a work in progress but testing is suspended until spring when there's more than just a thick layer of snow on the ground. I mostly worked on the mechanical design and the flight controller, with a bit of input here and there for the payload and other electronics.

- A Bluetooth personal safety device. When I joined the project, a 3rd-party had designed and manufactured the hardware; a different 3rd-party had designed the mobile apps. Everything was pretty flaky. I did a bit of firmware work on the device itself, and a significant rewrite of the Android Bluetooth code to make it drastically more reliable. This is the same company that is now integrating with the BLE-Iridium device to be able to provide coverage out of cell range. The coolest part of this project? There are people who are alive because of this project, and who likely would have died if the code was bad.

- A clean-sheet implementation of the LoRaWAN protocol. A client looked at the reference library provided by Semtech and felt they couldn't rely on the code. I made up the TLA+ model based on the spec (and uhh deviations from the spec that we learned about in certain hardware), and iterated on the spec until it seemed robust. "Ported" it from TLA+ to C, and it worked like a hot-damn.

Thanks for asking about this, by the way... I'm going to save this post as a bit of a resume and iterate on it to figure out what I've missed!

p.s. I love the website.

Amazon uses it, even made publications for it.

Just a nitpick: Property testing is nowhere near the level of assurance that model checking provides.

Absolutely! Which is why I'd love to incorporate actual model checking into my work. I said that I make use of some level of model checking, but what I really meant was that I make use of a poor approximation of it.

Edit: I was convinced later in the thread that this isn't a fair comparison :)

iterally DDG'ing "tla industry" gets you stuff eg. https://en.wikipedia.org/wiki/TLA%2B#Industry_use.

Another 20 seconds gets you https://lamport.azurewebsites.net/tla/industrial-use.html

Sure, but not all of us are building products at the scale of Amazon and Microsoft. When you're shipping something like DynamoDB, the argument for TLA+ is obvious. What's harder to come across are the people finding ways to make use of technologies like this on a smaller scale, and I think that's one place where Hacker News really shines as a way of connecting people.

At least personally, working at a small and understaffed startup, I'm not always going to find the time to put together big case studies on interesting things I do, but it's a lot easier to share a sentence or two on a casual Hacker News thread. I wouldn't be surprised if it's the same for other people.

That's not what was asked for. However I understand where you're coming from and am with you. I wish I had the chance to use TLA, or all the other goodies that seem ruled out by the results now and damn the bugs! mindset.

Fair :) My question was meant with more of a conversational slant than an objective one, but I can see how that tone gets lost over text. i.e. It's obvious that some people use TLA+, but I'm interested in what specifically the people here are using it for.

I've seen it used in RabbitMq, but I also wonder if this is used in the real world or just in academia/few technologies.

at the very least it's worth it to watch the video courses by Lamport himself.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact