They'd estimated something like an optimistic 4 months for the work.
Two of the more senior developers were assigned the task, and they had to learn TLA+ and formal methods from scratch. (one dev's main complaint was that he was forced to use Eclipse, I don't know if integration has now happened with other IDEs since then).
It took a little bit for them to get up to speed, but they were soon implementing and catching bugs in the proposed designs, and they had all the bugs ironed out within a few weeks.
They then found that going from TLA+ to actual Java code was almost a "fill in the blanks" exercise. The formal model provided almost the exact structure they needed for their final code. Going from model to code took them very little time at all. It was in production, running in shadow mode after not even 2 months, and was fully live not long afterwards.
TLA+ got subsequently used for a number of other features added to the platform.
There had been general push back among developers that learning TLA+ and doing formal verification was only going to slow them down, but we found it was quite to the opposite. TLA+ actually sped up development, with the big bonus of higher confidence of correctness.
There are a few projects to make this more accessible. We have
Jupyter notebooks: https://github.com/kelvich/tlaplus_jupyter
I sort of wrote a vim plugin, but it's nowhere near as sophisticated as the above three. I also recently made public a personal CLI tool (https://github.com/hwayne/tlacli) for running TLC on the command line. It's experimental but I find it really useful.
Nothing like this can be done in a standard programming language (even Scala or Haskell), and even in a dependently-typed language you would need to write some support code in order to embed this logic and work with it within the system. (It wouldn't be hard for sure, since even FOL gives you far more power than any (multi-)modal logic, but it's a different, perhaps more elegant take on things than just working directly within TLA).
Here's a presentation by Eric, a distinguished engineer, and Neha, a principal engineer, on analysing IAM policies (Zelkova?) and Network reachability (Tiros?): https://www.youtube.com/watch?v=x6wsTFnU3eY
This webpage has many more links to blog articles on the topic: https://aws.amazon.com/security/provable-security/
One aspect of TLA+ (and formal specifications in general) that I under appreciated when we wrote this was how useful they are as a communication tool. I've found well-commented TLA+ to be an excellent medium for reviewing and documenting protocols, and having high-bandwidth conversations about them. Complex protocol interactions can be hard to describe clearly in text, but the mix of text and TLA+ (or PlusCal) seems to work well.
Q: Do engineers actually implement AWS systems in TLA+?
A: No, TLA+ is called "exhaustively testable pseudo-code" internally, and is only used for designs.
Q: Which AWS systems use it?
A: The first was DynamoDB, followed by S3, and now there are 8 additional complex systems that use it for designs.
Q: Who uses TLA+ at Amazon?
A: All engineers, from entry level to principal.
I thought it was especially interesting that several engineers at AWS were effectively writing their own model checkers by writing programs that tried to brute force rare combinations of conditions that would lead to system failure. Obviously TLA+ does this better and faster so they switched to that.
> All engineers, from entry level to principal.
I'd phrase that as "some engineers of all levels". It's certainly not true that all engineers at AWS use TLA+ :)
- Pracitcal TLA+
It’s much easier to get running with than other formal methods tools like Isabelle and Coq.
Lamport's TLA+ book, and "Specifying Systems" are both available online (https://lamport.azurewebsites.net/tla/book.html). That one's worth a read too. Lamport's approach to teaching TLA+ is different, and probably more suited to the mathematically-inclined.
If you like spelunking through history, I'd also recommend checking out some of Lamport's papers on TLA. I particularly like "What Good is Temporal Logic?" which lays out his early arguments about why temporal logic is useful for solving some classes of problems. The paper is available here: https://www.microsoft.com/en-us/research/uploads/prod/2016/1...
Link to the book:
A bit more: https://news.ycombinator.com/item?id=9289890
Related from 2015: https://news.ycombinator.com/item?id=9287426
In addition to the books and sites mentioned here, there's also a video course with tutorial content produced by the creatore of TLA+: https://lamport.azurewebsites.net/video/videos.html
Note that no other bugs have been reported for the code since.